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ABSTRACT ... * , 

. Studies on the criterion validity of student 

evaluation-of-mstruction instruments are analyzed, and 
recommendations are. .offered for future research into student 

instruction. The main problem, an^l probably the reason 
Z\^t lack of validity studies, is that it is difficult to agree on 
what the criteria of effective teaching should be. One method of 
dealing with the problems of research in student 
evaluation-of -instruction instruments is to select a measurable 

, definition of teaching effectiveness. Since the ultimate criterion o'f 
teaching effectiveness is student learning, there is general 

' fSreement that an appropriate and defensible criterion is the amount 
that students learn as measured by achievement examinations. 
Attention is^directad to? studies i^ ^orporating achievement .scores 
and random assignment; studies incorporating achievement scores 
adjusteg for ability; studies incorporating achievement scores not 
adjusted for ability; and a meta-analysis of student ratings and 
student^achievement. Studies using criterion meas,ures in conjunction 
with achievement and studies using criterion measures ather than 
achievement are also reviewed. Tables are presented to summarize the . 
studies that examined the relationshin of student ratings of 
instruction and criterion measures. Although parallel data were not 
reported m all the studies r-the^able shows the largest significant 

. correlation reported in eacjx studK These largest correlations are 
squared, to indicate the/proportioiv/of variance shared by the 
criterion and the student eatings^ The majority of the investigations 
reported significant positive correlations between student ratings of 
instruction and criterion measures of effective teaching; however, 
the correlation between the ratings and criteria were usually modest.' 

.A bibliography is appended. (SW) jr 
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Foreword 



A great numberoTfactors arc pusft ng colleges and unTversnies to examine 
the ways they evaluate their faculty. Enrollment changes and an immobile 
prpfessoriatc mean that institutions must . tighten up their tenure and^ 
promotion policies. Limited financial resources^ constrain their ability to 
award merit raises and other perquisites. Changing patterns of student 
enrollment force them to consider terminating selected programs or fac- 
^ulty members. And the increase in litigation of personnel issues demands 
thabthey have definite policies and procedures for making such decisions. 
Finally, many institutions view evaluation of teaching as a way of helping 
theii faculty develop skills, rather "than only as a rating mechanism. 

One of the primary methods colleges have used to evaluate feewlfy has 
been the questionnaire in which students rate the instruction they have 
received u their classes. In recent years, a large number of studies have 
looked at the criterion validity of student evaluation-of-instruction in- 
struments*. In this Research Report, Sidney E. Benton, professor of edu- 
cation at North Georgia College, analyzes these studies and presents a 
number of iheit; problems and^eaknesjse^. as well as their strengths. In 
doing so, h(j also makes fecommendatior\s for future research into student 
evaluation of mstruction and how it can more properly serve the purposes 
it i? designed to accomplish. 

Jonathan D. Fife ' 
Director 

Clearinghouse on Higher Education 
The George Washington University 
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Overview » 

Let such teach others who themselves oxcel» 
And censure freely who have written well. 

Alexander Pope 

Ati ikasay on Criticism I ^ 

A number of procedures are used to measure instructiqn in higher edu- 
cation. These include evaluation by colleagues, appraisals by the dean or 
department head, evaluations by,means of audio or video tapes, appraisals 
of the instructor's course material by a faculty committee, and student 
evaluations of instructors. . * 

Of these procedures the use. of student evaluations has gained the most 
support. Writers have pointed out that these evaluations are made by 
those who have actually experienced the teaching. Studejpt evaluation-c^- 
instruction instruments are widely used and written about. 

These student ratings have be^n used primarily to improve instruction 
and to make decisions about faculty tenure, promotion, and merit pay. 
The.basfc assumption behind this use is that such ratings provide evidence 
of quality teaching. ^ 

Many faculty members, however, criticize the use of studerit rating 
forms, especially in matters of tenure, promotion, and pay. Faculty resis 
tance to the use of these forms stems from the fact that many rating forms 
have been prepared by groups or individuals who merely sat down and 
developed items that in their judgment had face validity with respect to* 
. measuring effective teaching behaviors. Repeatedly college instmctors 
point out that insufficient attention has been given to criterion validity 
checks. Criterion validity is perhaps best defined as "the extetit to which 
test performance is related tosome other valued measure of performance** 
(Gronlund 1981, p. 72). Hi tliis case, "test performance** is the students* 
ratings of their instructor on a student evaluation ^C-instruction instru- 
ment. The "valued measure of performance" is some other measure of the 
instructors* teaching effectiveness. These other measures* typically have 
been students' scores on a course examination, sfudent gair\ scores, stu- 
dents* score* on national examinations, students* interest in advanced 
courses, ratings of video tape clips, and ratings by trained observers. If 
the effectiveness of an instructor is to be evaluated in any part by student 
cvaluation-of -in struct ion instruments, it should be important to examine 
this relationship between the results of the ratings on such instruments 
and good teaching performance as indicated by other measures deemed 
to be valid. 

Since the ultimate criterion^of teaching effectiveness is student lear;n- 
ing, there is general agreement that an appropriate and defensible crite- 
rion is the amount that studertts learn as measured by achievement 
examinations. The majority of criterion-validity studies reviewed'invplved 
the use of such examinatfons for establishing criterion validity. One of the 
problems in such studies involves finding courses that have a la rge number 
of sections with a common examination. Such requirements are necessary 
to avoid statistical and research design problems. Statisticians generally 



agree that relationships based on a small number of course sections are 
^ apt to be unstable. . «o ' 

When courses with a large number of sections are located, it is jiot * 
eas^^ior the researcher to ensure that (he students in all.the scctibns'have 
the same aptitude at the beginning of the courses. If the sections do not 
consist of students with this equal aptitude at the beginning of the course, 
the researcher must assume that any differences extant at the end of the * 
course might have resulted from the initial differences rather than from 
the differential effects of the teaching. Random assignment to sections 
, provides ihc besl assurance that groups are equal at the onset of a stu'dy. 
However, in Imany situations the researcher cannot knake such arbitFar> 
as^>ignments. When randomization is not possible, many researchers have 
statistically adjusted for ability. Some have ignored that such differences 
may exist. Other researchers have had stu(*ents select sections without 
kno>^ledge of who the instructor will be, thiis reducing a possible system* * 
atic bias in the selcctidn process and giving someassurance that the groups 
^re equal at the beginning of the study© \ ^ 

Other investigations have involved measures other than course ex-. 
aminatioAs in eslabiisKing criterion validity. iHcse studies are reviewed 
aiid discussed inra second section of this monograph. 

It should be noted that a number of very weak studies are discussed 
in this monograph with the weaknesses delineated. There are two reasons 
for including these weak studies. In the fii^t place, they shed some light 
on the subject at hand, even though the data base isAveak. To ignore these 
studies would be to eliminate some iitiportant information. It has been 
pointed out: 

» 

A common method of integrating sAeral studies with inconsistent findings 
is to carp on the design or analysis deficiencies of all but a few studies 
. . . and then advance the one or two "acceptable* studies asThe truth of, 

the matter. This approach takes xksign and analysis too seriously To 

integrate research results by eliminating the "poorly done" studies is to 
discard a vast amoimt of important data. (Glass 1976, pi 4) 

Secondly, these weak studies arc often cited In books and articles on 
- studen? *valuatipn of instruction without drawing attention to their weak 
ncsses. Frequently in reviews of the literature of individual articles the 
statistical resul ts of related pieces of rciearch are summarized In a sen- 
tence or two, but the jimitations of the research are not mentioned. Thus, 
these "findings" become incorporajled into the^ mainstream of education 
'thought and practice \sqthoyit their legitimacy be^ng questioned. Specif- 
ically, some weak studies are being used inappropriately by colleges and 
universitijife to justify or reject the use of specific instruments or student 
evaluation'of'instruction instruments in general. 

A review of the literatuic suggests four major observations. The first 
of these observations on criterion validity studies is that the majority of 
the investigations reported significhnt, but modest, positive correlations 
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be twccr** student ratings aiitfcritcrion measures held to be indications of 
effective teaching. The syntl\esis of the findings of the studijcs indicates 
. that studenf evaluations arc t*\pping into an important dimension of teach 
H^g. Therefore, there is a legi:imate basis in using them to evaluate the 
performance of college teachen. 

A second major observation ts that, overall, tbj findings are highly 
4rtCor >istent, with a range of SigniFicant correlations rep(ipted between 
- .75 'and .96. However, sjnce th5 correlations are not Kigh)^ positive, it 
must be recognized that there is a great deal jriore to instruction than Is. 
accuratel)' reflected in student ev.iluations. Such evaluations should be 
. an important part of an overall ,a.«esi»ment of an instructor's 'teaching 
J performance, but it would appear 'hat a^ administrator or commiti^e 
thai makes decisions about a professv)^ teaching based on studenj cval 
uations alone is on shaky grouncLfndced. w 

There aje at leas| eight possible reasons for the inconsistent result?^^ 
reported In, the various studi,es. These i.iclade. smai! sample size^, a di> 
versKy^of the*iypes of cour^ses U5rng the evaluation forms, the number of 
typci» ot evaluation forms^used, tne failure to distinguish 'who was being 
evaluated (tcaclxing assistants or full time professors), a lack of standard 
izeu pruuidures for*thc admir^istration of foims, }he use of criterion mea 
. sures with unknown ps>chomelric prop^^rties. a lack of a lequate control 
for mitial ability of students in various course . actions, anidiffercnces in 
the times during the course the' evaluation formA were adml'^isicred. It 
seems reasonable to t»r*|jca that in practice student evaluations would 
^ parallel the rosear.h. Those whu use student cvajuations must realize tlia^ 

such evaluations will var} a great deal, according to whether the class is 
large or small, whether the course is of one type or another (basic or 
advanced, theoretical or practical, elective gr required, etc.), whether the 
types of evaluation forms fit the types of instructional procedures used, 
and whether the instructor is a tea^^hing assistant or fuil time professor 
Variations can also be^xpciited when the procedures for :he adminsstra 
tacxn of the instrument are not Ltandardized, when ;he psychometric qual 
itiesof the instrument Itself are lacking, when die st^^tdents ai c of differing 
^ abilities and attitudes, and when the instalment is ,\dminist^red at dif 
ferent points diying the course. 

A third major observation is that there is an identifiable trei^d in the 
frequency with which certain student rating variables emerge as signin 
cant predictors of effective teaching. Although an o\<:rall evaluatnn of 
inrfmction item or an overall score is ofter listed as, an indicator of ef- 
fective teaching, neither is generally useful to instructors as an aid ^or 
impnjving their teaching. The two specific categories or factors that emerge 
most often in studies as significan} nredictgrs of effective teaching relate 
to the skill of the instructor ar.d organization and planning. Instructors 
and evaluation cumnfiittees should, therefore, pay particular attention to 
their ratings on items that rcflcct^lhese two factors. . 

The fourth major observation fr6hi the review of the literature is that 
a definite need still exists for more studies of student evaluation-of in- 
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. s^ruction instruments. W 'ji the incrcj^sing demand for objective data in 
the evaluatiorf of profcsj^ors, there is every reason to expccf that these 
instruments will continue to be used. If §o, it is in the best interests oT 
higher education in this country that we learn more about. these instru- 
ments so that ibey can b<? used more fairly and justly. ^ 



The Uses of Student Evaluation Instruments 

•A number of procedures arc used to evaluate college instruction. These 
include ratings by coUeagues, appraisals by deans or department heads, 
evaluations by means of audio or video tapes, appraisals of the course 
material by faculty committees, and evaluations by students. The central 
purpose of this monograph Js to examine studies ihaijxlate 40 student 
evaluations and to make suggestions for the usj^^^^flfiese evaluations as 
wel! as suggestions for futurc^studics of stud<fni-evaluations. First, how- 
ever, the other procedures used to evalua^<fcollefee instruction mentipncd 
above will be discussed briefly, / . i 

• The evaluation of an instruct or's^erforma nee by colleagues and ad- 
ministrators .has been criticized. IvO^ucU evaluations: 

. / - • ' 

those raters seldom have observed the IndxAdual in theflassroom. There- 
fore they base their ratings ophis teaching 0.1 his perfcnmtice iu rather' 
different situations andlor on statements made by swie of his students. 
These stitdenis may or may mot be a representative sariple of the teachrr'^ 
classes. Further, the sample may or may not be comparable from one 
teacher or another (VQ€ks^^962, p. 212) ' 



There is a danger that in these evaluations the rater will "screen the 
tf:*cher s performance too much through bis own selcvtive perceptions of 
what constitutes good teaching" (Miller 1974, p. 31). This caution is ap- 
plicable to classroom visitations* by superiors and colleagues and to the 
use of audio and .video tapes. 

Evaluation of the instructors course material also has its failings.it 
IS easy for an instructor to get together an imprcssive syllabus. an array 
of objectives, anJ.a Ifst of readings for an appraised committee. This set 
of materials ma> bear little relationship to what goes on in the actual 
teaching situation. Although many, who produce such materials aire also 
good teachers, there is no guarantee that these mat<jrials accurately rep 
^ 'resent a good teacher. • 

. In recent years the use of student evaluations pf in^tructoni has gained 
much support. "Of stveral procedures, the student jnstcuctlonaJ rating 
approach is apparently being niarkcted most vigorously" (Frey 1973a. p 
3). "Widi intrea:»cd demand fur more careful assessment of teaching, ad 
mmisirators arc incorporating^tudent ratings of instructional efl^ectixe 
, ness into their personnel flctisiorts" (Shechan 1975, p. 687). Another pdsition 
IS that "student ratings constitute ^ne of the most credible indicators of 
professorial -perforoiJnce available*' (Scott 1975. p.**M5)* 

It also has been .poiijted out that stvdt^t evaluations of instructor 
c(fectiveness at"^ made by those v. ho havi^ actually ^xperiencl'd the teach 
tng. "Students arc the only persons who see the teacher day after day In 
•the classroom. They arx: not experts on how to teach, but they can furnish 
valuable evidence concerning tlie way their teachers teach*' (Hay^s 1963. 
p,168). Both the importance and the usefulness of the opinions of stu- 
dents cunccmmg their instructors have bc^n emphasised by a number of 
sources. * 
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What seems to oe most lacking in current practices is carefully accU" 
mulated information about a teacher's actual performance. Student opin- 
ion is of particular importance here because it r^resents an important 
addition to the data customarily used to judge faculty competence. It is 
the one source of direct and extensive observations of the way teachers 
carry out their daily and long-range tasks. (Eb!e 1971, p. 14)^ . 

Student evaluations of instructors have three^major uses: to help in* 
stitutions make decisions about faculty tenure and promotion, to help 
students select courses or instructors, and to provide information that 
instructors can use in changing then: courses or teaching methods (Centra 
1980; Blount, Gupta, and Stallings 1576). The need for evaluation of teach- 
ing definitely exists if for no other reason than to improve teaching per- 
formance, c. 

It has been suggested that many faculty members do use^the ratings 
for purposes of course improvement and self-improvement (Rom^ne 1973). 
^"Many faculty regard student evaluation of their courses as an indication 
of their teaching success, and njay actually allow the results toshajie their 
subsequent pedagogical behavil)r" (Bausell and Magoon 1972, p. 10J3). 

Many faculty members criticize the use of student rating forms. Iitojv- 
ever, after conducting one of the most comprehensive reviews of the em- 
pirical studies pertinent to these criticisms, Costin, Greenough, and Mengcs 
(1971) concluded that these ratings can provide reliable and valid infor- 
mation on the quality of courses and instruction. However, they point out 
that "faculty resistance to the use of student rating forms may stem par- 
tially from the fact that many rating forms have been prepared by groups 
or individuals not qualified to construct such instruments" (p. 511). This 
claim seems foundedj. According to Miller, "Too many procedures for eval- 
uation consider only the first step, the development of evaluative criteria" 
(1974,'p. 15). ^ 

In the past, many of these forms wpre constructed by people who merely 
sat down and developed items that in their judgment had face validity 
with respect to measuring effective teaching behaviors. In many instances, 
insufficient'' attention was given to the rationale *for devising items, to 
revision of the items, ahdto reliability and criterion validity checjcs. Many 
stujlcnt rating forms are considerably lacking in attention to predeter- 
mined criteria as a basis of their construction (Costin, Greenough, and 
, Menges l971; Miller 1*974). 

Today is the VAge of Litigation" for institutions of higher education. 
-^Thosc^whainake decisions aboutiaeulty salaries, tenure, and promotions 
have to be able to produce evidence to support their decisions. In the 
search for data that can be so employed, they have frequently mandated 
the use of student evaluation-ofMnstruction instruments. Those who use, 
or r(?quire the use of, such instruments often know little about the devel- 
opment or the psychometric properties of the instruments they choose. 
Since such critical decisions are affected by the use of these instruments, 
it is important to learn as much about them as possible* 
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Selection of a Student Evaluation-of-Instruetion Instrument 
There are £i great number of reports in the literature on instruments for 
student evaluation of instruction,and their use. Pettman's (1 972) anftotated 
bibliography on student evaluation a'rtides published betvven 1965 ^nd 
1970 listed 107 articles. Biddle s (1980) annotated bibliography, incor- 
porating the ERIC files dating between 1976 and 1978, listed approxi- 
mate ly*280 items. 

An extensive search of the literature to identify the best and most 
widely usable student evaluation'Orinstrucfion instruments was made by 
Benton (1974). His search was guided by three predetermined criteria: 
(1) the instruments had to be applicable to the various academic areas, ^ 
not specific to one area (such as psychology), (2) the instruments had to 
be designed to evaluate college and university teaching, and (3) the in- 
struments had to have been designed to provide information that could 
be used to improve instruction. Only 39 instruments, of the hundreds 
reported, were located that met even these very basic criteria. Since 1974, 
a number of other instruments have appeared in the^literature, but only 
a limited number of them meet these criteria. 

Considering all the checklists and other forms available for students' 
evaluations of instruction and the amount of literature available on the 
topic, it is understandable that instructors, faculty committees, and td- 
ministrators find it difficult 4o select one instrument in which they can 
have confidence. 



Problems of Criterion Validity 

One criterion suggested in selecting an instrument for student evaluation 
of college instruction was that "validity, beyond simple content validity, 
has been substantiated" (Benton 1979. p. 15). The type of validity appro- 
priate in thfs case is called criterion validity, sometimes referred to as 
empirical or statistical validity. It is defined as the degree to which scores 
on the instruments for student evaluation of instruction are in agreement 
with some given criterion. measure of effective teaching. Although it is 
easy to say that student evaluation instruments should have criterion 
validity clearly established, it is not easy to find studies that report such 
information. Validity is one of the typical faculty concerns in the use of 
such instruments (Aleamoni 1974). 

To establish criterion validity of instruments of student evaluation of 
instruction, the following three steps generally are involved: 

• The instrument is administered to a group of individuals. 

• A criterion measure of effective teaching is obtained. 

• The two measL* ^s are correlated. 

Thw resulting correlation, or validity coefficient, is an indication of the 
crUeriun validity of the studertl evaluation instrument. The range of the 
coefficients can be from .00 (indicating no relationship between the two 
measures) tu 1.00 (indicating a perfect relationship between the two mea 
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sureS). The closer the correlation is to 1 .00 the higher the criterion validity. 
No student eva!vation-of-inbtruction instrument is expected to have a 
perfect criterion validity coefficient; therefore, predicting teaching effec- 
tiveness based on these instruments will always be somewhat imperfect. 
However, the larger the. validity coefficierft, the les^.the error in predicting 
effectiveness and the more effectively the two measures reflect each other. 

The chief problem in establishing criterion validity is the difficulty in 
obtaining a satisfactory criterion measure (Thomdike and Hagen 1969). 
So the main problem, and prgbably the reason for the lack of validity studies, 
is that it is difficult to agrde on what the criteria of effective teaching should 
be, "Validating student ratings at the university level is difficult since 
there are no clearly defined criteria of instructional quality" (Marsh, Flei- 
ner, and Thomas 1975, p. 833). "Validating a measure of aconstruct like 
teaching effectiveness requires the use of many alternative criteria" (Marsh 
1977, p. 442). 

"Most studies of validity have used correlations with jpeer ratings or 
supervisor ratings as the criterion" (Sullivan and Skanes 1974, p. 584). 
However, what is needed is a focus on criterion validity studies that relate 
to the direct outcomes of effective instruction. Therefore, as stated at jhe 
beginning of this chapter, the central purpose of this monograph is to 
examip? studies that, relate to these outcomes and to give suggestions 
regarding the present uses of student evaluations and suggestions regard- 
ing mture studies of student evaluations. 
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Stud^ent Achievement and Evaluation of Instruction 



One method of dealing with the problems of research in student cvalua- 
» tion-of-instruction instruments is to 'select a measurable definition of 
teaching effectiv.eness. Since the ultimate criterion of teaching effective- 
ness is student learning, there is general agreement that an appropriate 
and defensible criterion is the aijnount that students loam as measured 
by achievement examinations. 

One of the usual approaches to studies that examine the relationship 
of student evaluations of instruction and student achievement is for the 
researcher to select a course that has several sections taught hy different 
instructors but has a common examination. "In this case there is an agreed 
upon, measurable, and common educational outcome which can be used 
as a criterion of teacher effectiveness" (Siiultz 1978, p. 15). For each section 
of the course, the mean examination score is then correlated with the 
mean of the students' ratings of. instruction. A significant positive corre- 
lation is held to be empirical evidence of the criterion validity of the 
evaluation instrument. 

There are several problems inherent in such an approach to establish- 
ing criterion validity. One problem is that courses with a large number 
of sections and a common examination are difficult to find oven in large 
universities. Also, there are man> studies that involve a laige number of 
student responses but compare onl> a small number of instructors. These 
comparisors are likelj 'to be unstable. Even if the conditions of goodly 
numbers of sections are met: 

the statistical tests are generally not very powerful. With 10 different sec- 
tions, a YKilidity coefficient would have to be .55 to reach even the .05 level 
of s'lgnificunce. Extremely high validity coefficients cannot be expected 
sincz performance depends upon many variables besides instructional 
quality and evaluations depend upon many fat ors besides learning mea 
sured by a final examination, (Marsh, Fleiner, and Thomas 1975, p. 834) 

Another problem is that even when courses with large number of sec- 
tions are located, it is nut eab> for the researcher to ensure that the students 
in the various sections have the same aptitude at the beginning of the 
course. 

If it cannot be demonstrated that predisposing factors such as student 
ability and njotivation have been equated across the different sections of 
the multisection course, then it may f^e these variables that produce Jie 
Qgrrelation between student ratings and exam verforynance. (Marsh and 
Overall 1980, p. 469) " ^ 

In order to compensate for these possible initial differences in aptitude, 
various researchers have randomly assigned students to course sections, 
statistically adjusted for initial ability, or had students select course sec 
tions without knowledge of who the instructor was to be. Some researchers 
simply have ignored the existence of differences in course sections. 
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Studies Incorporating Achievement Scores and Random Assignment 
The best method of controlfing for ijpssibic initial differences among \ar 
ious class sections would be to randomly assign students to the sections. 
Centra (1980) suggests that 'Vandomi/ation of students is one of the steps 
needed to draw a cause and effect relationship between rated teacher 
effectiveness and student learning**(p.37). Howcvjr, Researchers in college 
^settings rarefy arc able to do this. Qnl> two studies were located in which 
students were randomly assigned to trie sections. 

Sullivan and Skanes(1974) used 130sectionsof ten coursesat Memorial 
University of Newfoundland, Canada. Students were randomly assigned 
to .sections in each of the courses. Sullivan and Skanes reported low to 
moderate correlations for mean instructor ratings and mean final exam- 
ination scores for the ten courses. Of the ten correlations, eight were abo\e 
.32, and the average ctfrVclation was .39. However, only two of the ten 
correlations and the average correlation were significant. The resedrchers 
pointed out that one possible reason for the sniall conelatiun^ was that 
the range for the two variables was restricted. The overall rating for the 
instructors was based on a Five point scale, and there waN little.variability 
in the examination scores. 

OiTe of the major strengths in the Sullivan and Skanes stud>. other 
than the random assignment of students, was the development and ^curing 
of the final examination for the ten courses. Examination committees 
constructed the examinations and set guidelines for grading eaJi answer. 
The examinations were scored b> boards with a ''small group of faculty 
members marking one answer on all papers" (p. 585). The student eval 
uattons were done anonymously. The correlations involved only a global 
rating of instructor effectiveness rather than a number of dimensions. 

The study involved two different biology courses, and one course each 
from physics, psychology, and science. Two oCthe courses had six sections, 
two had eight sections, two had nine sections, and two had 14 sections. 
The remaining two courses had 16 and 40 sections. 

In the course that had 40 sections, the correlation between instructor 
ratings and examinations was .41. When the correlation was calculated 
for two subgroups,^27*full'time instructors and 13 part time teaching as 
sistants(TAs), the correlation was .53 for the full time instructors and .01 
for the part time TAs. When the amount of experience was considered for 
the 27 full time instructors, the correlation between ratings and achieve 
ment for experienced faculty (one or more years of full time teaching) was 
.69, bu^for the incxp.erienced (those in iheir first year of full tfthe teaching) 
thecorr.c.l'ation was .13.Sullivag and Skanes thus suggest that theiriesults 
may prov ide some answer s to some of the contradictory results of prev ious 
studies. The;;, further coiKlude, '\alid ratings are much more common 
and are easily ob tamed in the case ofexperienced and full time instructors 
than in the case of inexperienced or part time instructors (p. 587). Again 
because of the si/.e of these subgroups, the data must be regitrded as 
tentative. It would seem appropriate to do further rc;5>earch in the areas 
of ratings of full time \ersus part time instructors an^d expei:ienced versus 
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inexperienced instructors and the relationship of the achievement of the 
" ' ' ^'"^ ^fudehtsr 

A second stud> in which true random assignment of students to sections 
was employed was reported by Centra (1977). The study also included 
sections of courses in which randomii^atton was not used. In the Centra 
study there were 72 sections of seven courses. In two of the seven coui'ses» 
a biology course and a chemistry course, students had been randomly 
assigned to sections. As in the Sullivan and Skanes (1974) study, the sub- 
jects were from Memorial University in Newfoundland. Instead of a single 
global item. Centra used nine variables from the Student Instructional 
Report (SIR).Theselariables were. "Overall Teaching Effectiveness," "Value 
of Course to Student," "Teacher-Studenf Relationship," "Course Objective 
and Organi/.ation," "Reading Assignments," "Course Difficulty and Work- 
load," "Examinations, Lectures, and Student Effort." For the two courses 
in which the students had been randomly assigned to the sections, the 
highest correlations with mean final examination performance were for 
the area of Value of CoUrse to Student. The correlations were reported to 
be .73 and .92 for the two courses. J3ther significant correlations reported 
were .8 1 (Examinations) for the biolbfy course, and .76 (Lectures) and .79 
(Student Effort) for the chemistry course. 

In the Centra study alniost all the instructors were experienced teach- 
ers, nunc weie graduate teaching assistants. The final examinations were 
cJeveluped an J i>eured as in the Sulli\an and Skancs study . Howevei , again ^ 
the resuhs of the study must be interpreted with caution. Of the twj 
euurses that had students randomly assigned to the sections, there were 
only sc\en sections of each course. Regarding all the 72 sections Centra 
concluded: 



The pattern of torrelatiatKs across the courses iiulkated that the global 
ratings of teather' effectiveness and of the value of the course to .students 
were niosthighh related to tuean exam perfonnance (12 out of 24 product 
nionien: and partial correlations were ,58 or above). Ratings of coitrse 
objectives and organization and the quality of lecturers were also fairly 
well correlated with achievement. Ratings of otlfer aspects of instruction, 
such as teacher student relationship or the difficulty Iworkload of the course, 
were not highly related to achievement scores, (p, 17) 



Studies Incorporating Achievement Scores Adjusted for Ability 
Since students usually know who the instructor wiJl be when they select 
tbeii, e.oursc section, it is possible that differc\nt s^ could differ mark 
eclly in i>tude/it abilities,.and attitudes. For example, tlie bc:i stiideriis 
m»ght chouse the teachers whh a reputation for good teaching and high 
standards for students. The poorer students might select the teachers who 
are less dennanding and who give higher grades. A comparison of two such 
sections, thus, would be contaminated by the way the students came to 
be in those particular^etions ih the first place. 

When sections are unequal in abilities and attitudes at the beginning 
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of a study, so;nc statistical adjustment is necessary toliiccount for these 
initial differences. The process is usually one of computing residual scores; 
that is, scores statistically adjusted for initial ability or attitude. Re- 
searchers generally use some nationally normed aptitude test (e.g.. Scho- 
lastic Aptitude Test), a pretest in the course area, or grade point averages 
(b compute residual examination scores. Although^ the procedure is de- 
fensible, simply adjusting^cores statistically is not assatisfactoi'y as ran- 
dom assignment of the students to sections. However, in the studies t^viewed 
in^ this section, the researchers made some adjustment for ability in a 
portion of their study. ^ . 

One of the most controversial investigations cited in the literature is 
a study by Rodin and Rodin (1972). The study is cited first in this s<;c(ion 
because it has had so much visibility; the results have provoked much 
discussion and some of the studies cited later were conducted as a reaction 
to the Roxlin^nd Rodin ffndingsT Rodin and Rodin reported a strong 
uegative correlation between achievement and instructor ratings, Rodin 
and Rodin used leaching assistants in an undergraduate calculus course. 
The students m£l three days a week for a lecture with a professor, and on 
the remainingtwo days met with individual teaching assistants in 12 small 
sections. The teacher rating form used in the study was not specified; 
moreover, jnly the responses to one question on the form w^rc used in 
the analysis. The question was, "What grade would you assign to his tojal 
teaching performance?" Numbers were assigned to these ratings (A 4 
to F =^ 0). A measure of the students' initial ability in calculus was ob- 
tained from the previous quarter. Mean grades in the course for the 12 
sections and mean section ratings were used in the calculation of a partial 
correlation. This partial correlation between the objective measure (the 
grade determined by the number of problems passed) and the subjective 
measure of teaching abilit> (the one question on the student evaluation), 
with initial ability held constant, was ^.75. "The instructors with the 
thrc*e lowest subjective scores received the three highest objective scores. 
The instructor with the highest subjective rating was lowest on the ob- 
jective measure" (p. 1 1 65). The researchers concluded, "Students rate most 
highly instructors from whom they learn least" (p. 1164). 

Many researchers (Bryson 1974, Frey 1973a, 1973b, 1978; Gessner 1973; 
Marsh, Fleiner, and Thomas 1975, Rippey 1975) have cited methodological 
problems in the Rodin and Rodin study. 'One problem is that it is "in- 
consistent with common sense as. well as with accumulated a*sults of 
previous research on this topic" (Frey <^973a, p. 4). Another weakness is 
that the research had assessed the effectiveness of graduate teaching as 
sislants (TAs) who has only complemented the activities of the professor 
(Frey 1973a). It should be noted specifically that (in contrast to a great 
many other studies wher<j the TAs were, indeed, the instructors for the 
courses) tliese TAs, though designated as instructors, really were only 
assistants who had ? minor role in instruction. 

Frey (1973b) in» 4igated the conclusions of Rodin and Rodin in his 
study examining t,. J different calculus courses that had a regular faculty 
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member and teaching assistants. The students nmct with the faculty mem- 
ber three times a week for lectures and with a teachirtg assistant once a 
week for a quiz. Each course had a common syllabus and a common final 
examination. Approximately 75 percent, or 354, of^he students completed 
an instructional rating form used at NorthwesterrT University. The form 
was mailed to the students whohad completed theexamination and whose 
Scholastic Aptitude Test (SAT) scores were on file at the university. The 
average final examination score for each instructor (adjusted for initial 
difference in sections using composite SAT scores) was the criterion for 
validat ion of the student rating. One special strength of this study concerns 
the reliability of the grading system. All examination papers were scored 
in a common session with an instructor grading the same item for all 
sections. 

Prey factor analyzed instructor ratings using individual responses from 
these and other jclasses and found six factors, indicating that the evaluation 
form was measuring six different areas. A Pearson product-moment cor- 
relation was calculated between the adjusted final examination score and 
each of the six^factors. In the introductory calculus course (eight instruc- 
tors), three factors— "teacher's presentation/' "organization-planning," and 
"student accoinplis|imcnt**— showed high positive correlations with the 
regressed final examination scores. .91, .87, and .84, respectively. In the 
multidimensional calculus course (five instructors) the correlation be- 
tween "student accomplishment" and the examination was .90. When the 
correlations for the two calculus courses were averaged, the "student ac- 
complishment*' factors and the "teacher's presentation" factor were the 
highest predictors of achievement (.87 and .75). "Teacher accessibility" 
and "work load" correlated the lowest (.31 and .44). 

A weakness in Frey*s study was the small number of sections involved 
in the analyses. Frey admitted "correlation coefficients based on such a 
small number of observations ar<j notoriously unstable" (p. 84), The use 
ufifactors, obtained by analyzing individual responses, in prediction of 
class mean achievement scores is also open to question. 
. Like Frey (1973b), Doyle and Whitely (19,74) used examination scores 
m conjunction with student ratings of college instructors. The premeasure 
of abihty of 174 bcgij*ning French students taught by 12 graduate students 
at the University of Minnesota was the Minnesota Scholastic Aptitude 
"fest. The Student Opinion Survey (SOS^, with an addition of seven general 
Items, was used in Vating ihc instructors. Two types of data were included 
in the study. bet\* ccn-sections data and across-sec tiohs data. Between- 
sections datrt ct....,.ared class trends and involved correlations of section 
means. Acruss-settiwns data were, from all sections, pooled, and involved 
correlations of raw item responses. When the seven general items wertj 
analyzed across sections (174 students), six of the items had significant 
correlations with icsidual examination scores. The correlations ranged 
from .18 to 25, However, when the same seven items were analyzed be- 
tween sections (12 instructors), only two of the items had significant cor- 
relations (.51 and .49) with residual examination scores. These two items 
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related to "general teaching ability" and "overall teacher effectiveness." 
Since !he SOS had been factor analyzed using individual responses, the 
correlations of the factors with residual achievement were done only across 
sections. Two of the factors, "motivation of interest" and "expository skills," 
correlated significantly (.36 and .31) with residual dchievement. 

The Doyle and Whjtely study did not provide multiple correlations 
using factors of SOS to predict residual achievement. Furthqr, the stability 
of the correlations in the across-section analyses is open to question be- 
cause of th<; small number of classes in the study. Also, the $even general 
items that were adc^d to the SOS must be questioned. No information is 
given as to the origin of the items and the reasons.for their selection. 

Another study using student ratings to predict residual achievement 
was by Turner and Thompson (1974X Unlike the Frey (1973b) and the 
Doyle and White ly (1974) studies, in which small numbers of classes were , 
used, the Turner and Thompson investigation used one sample of 16 sec- 
tions of beginning French students and another sample of 24 sections of 
beginning French students all taught by TAs, Residual arhievement (filial 
examination corrected for first examination) was computed. Members of 
the French Department selected 30 items frorn the student rating instru- 
ment reported by Deshpande, Webb, and Marks (1970). Five items specific 
to leaching beginning French were added to this list of 30 items. These 
items related to the instructor giving student^ opportunities to speak in 
French, having a good command of French, having a knowledge of the 
culture of French-speaking peoples, making pronunciation eirors in French, 
and being enthusiastic about speaking French. Two subscales (labeled 
"Instructor Cognitive and Affective Merit Versus Student Cognitive and ^ 
Affective Stress" and "Motivation and Workload") and a total subscale 
score wer? then used as the student rating variables. When the two sub- 
scales and the total substale scores were used to predict residual achieve 
menty negative* correlations of -*,5l, - .51, and --,52 were obtained for 
the first sample apd - ,41. -.31, and -.41 for the second sample. 

Of all the studies herein reviewed, this is only the second case in which 
a significant negative relationship between ratings and achievement was 
reported. The authors suggested that the "stress/overload" produced by 
the instructor was tfie important factor in obtaining greater residual 
achievement and that the positive behaviors of the instructor appeared 
to load to kess residual gain. Since the vast majority of .studies in this area 
show opposing results to those of Turner and Thompson, their study should 
be noted, but the findings should be viewed with caution. Turner and 
Thompson concluded: 



the re^uUi* of the study suggei^t that student ratitigs of college instructors 
should be treated with great caution by college administrators and pro- 
motion and tcmirc committees. Although such ratings may express student 
observations of and attitudes toward an instnictor, they clearly cannot be 
routmely interpreted to be positive indicfltors of student residual achieve 
ment in the instructors couYse, (p, 3) . 
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Without a substantial number of other studies \^th similar negative « 
correlations, il is perhaps most useful to try to determine ^vhy this study 
had such different results from the general body of the literature rather 
than generalize from this study about the whole question of student rat- 
ings. The article itself gives no basis for speculation as to why these results 
were different. One reason might be because these instructors were all 
' TAs rather than*fuH-time prof essor s. Also, in the Turner and Thompson 
''study not enough information was^iven concerning the achievement ex- * 
aminations. Although the authors stated ll\at the first examination covered 
grammar and the final examination cov^ed grammar, dictation, com- 
position, and reading comprehension, they did not state whether \hc test 
items were objective, cssay^r a combination the two. The type of items 
on the test is an inipprt^t consideration because the scoring of essay 
items is generally riot as reliable (consistent! a^ong various instructors 
as the scoripg (^objective ittms, and no i nform ation Was reported about 
ihjs_sc^irtg. * ~ \ 

Jn^ fifth study, only a portion of the scoresliaM in the analyses was 
adjusted for ability. Frey, Leonard, and Beatty (197^) collected ratings of 
instructors from 16 sections of mtroductory calculus ai Northwestern Uni- 
versity, ten sections of educational psychology at Purdiic, and five sections 
of introductory calculus at North Dakota State. Eachvrf the three insti- 
tutions used the Endeavor Instpctional Rating Form. At each institution 
Instructors used a common syllabus, textbook, and finaKexamination. A 
factor analysis of the responses from Northwestern and ^jirdue indicated 
similar factors. For three of these factors the corcelatiort with final ex 
amination performance was 'Tairly strong" at the three institutions. The 
mean correlations for the factors and achievement at t|ie three schools 
were .59 for "student accomplishment/' .58 for "presentation clarity," and 
.51 for "organization-planning." It should be noted that the best predictor 
of final examination performance found in any comparison in the study 
was "organization-planning." (At Purdue this correlation was .85.) 

For various reasons the correlational analysis was x)oi based un all the 
original sections. Four sections were eliminated from the Northwestern 
data, and one section was eliminated from the Purdue data. The research 
ers do not spq^cify how many of the instructors were teaching assistants. 
Mathematics SAT scores were used for the Northwestern Analysis tu adjust 
^ final'examination scores for the j>ections. No adjustment was made for 
the other sets of data. An overall consideration indicates that the study 
provides moderate support fgr the use of student ratings. 

The da^a of th^e Frey, Leonard, and Beatty (1975) study constituted a 

"qualitative improvement uvet thatCvhich.was availablcjn.ine-Ercy (L973b) 

study" (Scott 1975, p. 445). Apparently this judgment is base^d upon the 
increa$ed number of course sections used in the study. In addition, the 
findings of the study provide: • ' 

additional support for the contention that at least some infonnatiori from 
student ratings is positively related to student achievement, a trend which 
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must be substantiated if widispread use of student ratings for merit, pro- 
motion, audlor instructional improvement is to be continued. (Scott 1975, 

* 

Another study conccrijyig Ihc relationship between regressed exatni- 
nalioh scores apd achievement was by Frcy (1976). Frcy compared the 
final examination performance of students in seven sections of introduc- 
tory calculus at Ivlorthwestem University to student^ ratings of the instruc- 
tors. Randomization was used in assigning subjects in each aectiop to two 
time-of-rating groups. The researcher compared the mathematics SAT 
scores for the two groups. Some subjects were reassigned after this com- 
parison to ensure that the two groups were equal in mathematics aptitude. 
When students signed up for the sections, they did not know wKich in- 
structor was to teach each section. Students who later requestccl section 
changes were "aciivcly discouraged." . 

Ratings of the instructors were conducted by a mail survey; half the 
students n\ted the instructors ^uring the final week of classes and the 
other half during the first week of the sqbsequentierm. Frey reported that 
the two different times (before and after the examination) did not signif- 
icantly affect the ratings of the instructors, although the ratings made 
after the examinati^on showed a slightly stronger corrdation. Results 
the study indicated a strong relationship betwevp instiniclor ratings and 
final examination scores, based on regressed mathematics SAT scores. The 
highest correlation r^cported for the "before exam"^gro'up was ,90 between 
"planning" and the final examination. "Student accomplishment." "per- 
sonal attention/* and ^'presentation skill" were the three best predictors 
of final examination performance for the "after exam" rating group. The 
correlations reported using thifse three aspects of instruction were .83. 
.85»aiid .78 respectively, providing reasonably strong validation of student 
ratinj^s. , ^ ' 

Instructors of th'» i,^-v*-u sections were full time faculty members who 
used a co^.imon text and a common syllabjis. In one group 68 percent 
returned the rating form» and in thej);her gr(>up 70 pOrcent did so, Frey 
reported similar mathematic^SAT scores and a similar final exarjiination 
scores f- . the rcsponders an^l nonresponders. Frey reported ^hat the eval- 
uation form usee! in the study, strtjssing student observation rather than 
student opinion, was the result of a long development prcxesp. The major 
criticism of the Frey study relates to tfie small number of coilrse sections, 

Whitely and Doyle (1979) also investigated the relationship of student 
ratings to achievement. The researchers compared the rating^ of five pro- 
fessors and 11 teaching assistants of a beginning mathematics covirse at 
the University of Minnesota. When the data were calculated for between 
classes, "ovcrah ieaching effectiveness" was significantly corrected with 
the residuali*:ed Hnal examination for the professors (.80), but jt was not 
significanllj ^.cr/olated with achieveiinonl for tjie teaching assislam^. The 
prt measure of ability was the Minnesota Scholastic Aptitude Tesl/WSAT), 

As in previous studies, because of the small sample size, the data of 
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the Whitcly and Doyle stud> must be intcrprcled cauliousl>. In the study 
the Student Opinion Survey vAx^ ttic evaluation instrument, and theMSAT 
was the ability measure used to residualize examination scores. Students 
supplied identification numbers but were assured that the results would 
be confidential. No information is given in the study about the construe 
tion of the final mathematics examination. To ensure reliability in grading, 
each teaching assistant graded one problem from all the'^students. The 
report seems to indfcate that the teaching assistants also scored the papers 

'from the professors* section, alth ugh this was not specified. Thus, once 
again, there is some support for the use of studcipt evaluations with full- 
time professors, but not for their use with teacnirTg assistants. 

' In one article, McKeachic, Lin, and Mann (1971) reported five studie^ 
tbat pertained to critericsi raeasurcf and student ratings of instruction. 
In one study, scores were adjusted for intelligence, but the intelligence 
lest was not identified in the report. All ^correlat ions i.n^the reported studies ^ 
were done using mean section scores on the student evaluation instru- 
ments And class mean achievement'scorcs, hu multiple correlations were 
reported. 

In the first study, students in 33 (in the table they report 37) sections 
of general psychology evaluated 17 instnjctors with the Isaacson et al. 
(1964) evaluation instrument. Four factors of the instrument, ''skill/* 
"feedback," "interaction/^ and "rapport'* correlated significantly (.28, .35, 
.30, and .42, respectively) with the Introductory Psychology Criteria Test, 
labeled a "thinking** lest. 

The study was thenj replicated with 34 sections of general psychology, 
and results were analyzed separately for men and women. For a second 
criterion, 25 items were .taken from old examinations to mak(3^a "knowl 
edge" test. For males, "interaction" correlated significantly with the 
"thinking" test (.33), and "overload*' correlated significantly with the 
"knowledge'* test (.39). For females "feedback'* correlated significantly 
with both the criterion nieasures (.33 for the * thinking** test and .40 for 
the "knowledge" test). 

• sin the second study students in 32 sections of general -psychology eval- 
uated 16 instructors. None of the factory was significantly correlated with 
the "thinking" or the "knowledge" test for i;ither females or males. 

In the third study, only six instnictors were involved, and the number 
of sections was not reported. The criterion measures were a multiple- 
choice test of knowledge and an essay test. "Skill*' correlated significantly 
with the ess^iy tes^^ for females (.65), This correlation was the only signif- 
icant correlation in the study. 

The saqiiplc of the fourth study consisted of 16 sections of second-year 
French. Criterion measurcs^of the study were a tes^of grammar, a test of 
reading, and a depart mentally administered test of oral expression. None 
of the student rating factors <^ora*lated significantly with any of the three 
French criterion measures foi; either femalcs^or maples. 

In the final study, 18 advanced graduate students, who were the in* 
^tructors, were evaluated by their students in sections of introductory 



Rating College Teaching ■ 17 



economics. The rating scales used in the study consisted of 12 items with 
high loadings from the Isaacson el ah (1964) scale plus items previously 
used in the economics course. The two criterion measures were a numer- 
jcal grade based on course exanvinations stressing ''thinking" and an eco- 
nomics attitude^sophistication change score. For males, ''structure" was 
significantly negatively correlated wiih the grade (-.41). For females 
''changes in beliefs" correlated ^sigrtificaally with the attitude sophisti- 
cation change, score (.44) and ''skill** correlated significantly with th€ 
. numerical gradi? and the attitude sophistication change sco' j(.72 and .43). 

The five studies by McKec-urjliie. Lin. and Mai(fh (J97 1) \iustrale a point 
. , made earlier— namely, that when one uses different populations, different 
examinations, and variations in the evaluation instru ^ent (with different 
factors), one can expect wide variations in the resuits. Criticism of tfieso 
five studies as reported by McKtiachie et^al. mainly has been concerned 
with whj^t was not reported. lu three of the studies the variations of the 
student evaluation inslnimeal were not described clearly. In some of the 
studies not enough information was given to determine the wc M of the 
measures of achievement. The reseafchers report that intelligence was 
partiallecj out of the correlations of the first study, but no indication is 
made of this adjustment injh0 other four studios. If no adjustments were 
^ pade concerningMnitiial' ability in the section, the results are open to 
further question. Also, one onn<; studies is based on a sample of only six 
sections. In two of the studies the authors specified that graduate stjudentb 
taught the classes; no mention is made of the status ot. the instructors in 
the oflier studies. \ 

Two studies (Canaday, Mendelson, and Hardin 1978, qoyle and Crich- 
ton 1978) dealt with adjusted achievement scores and studetjt evaluations, 
although the main focus of these studies was on other reseai;ch coacferns. 
^ Jhc present discussion deals only with those d mensions of these Siudicb 
that liave to do w^th student achievement as the (Iriterion relatedto student, 
evaluation. • \ ^ 

«^ Cai^5(y, Mendelson, and Hardin (1978) m\4stigated the effect oftim- 

. ing on The validity of student evaluatiyn in a one-senion course \n anat- 
. ^ omy. They reported a significant relationship belv.ecn the jcyurse 
achievement, as measured by multiple-choice exaltrinalions, and the course 
ratings of students in the College ofMedicine, Medical University of South 
Carolina. The researchers reported a partial correlation of ,42 betvyeen 
achievement and ratings, when CPAs were contAjlled. A 31-iteni student 
evaluation instrument was designed for the study, and examination re- 
liabilities were reported to be .81 and .85. Becaijse of attrition (some of 
the ratings were collected three weeks after the .final examination) and. 
other factors, the data of the study vverc based ort only 93 of the original 
158 students, but the study does lend moderatcn support to the use of 
, student ratings. ^ 1 

Doyle and Crichton (1978) investigated the r6lalion.ship of student, 
peer, and self evaluations to student achievements They had usable data 
from 263 student ratings of 12 institictors in a ^our^e in introductory 
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communicaiions. Most of the instructors were graduate students- No btu-' 
dent, peer, or self-evaluations of instruction correlated significantly with' 

/residual e<xamination scorc^. Final examination scores were adjusled^by 
using verbal scores from tno PrclinJmry Scholastic Aptitude Test. The 
student evaluation instrument cgnsiited of four items fronv factors iden* 
^ tified byDoyle and Whitely(1^74) plus two ov(Jl*all evaluation items. Thus, 

bnce again, Ratings usin^ mostly teaching assistants as instnictors were 
not related 'to achfcvcment. 

Finally, Benton and Scott (1976) did rvot calculate residual-achievement 
scores, but used self-reported grade point averages (CPAs) as on<» of the " 
independent variables iri the calculation of a multiplc correlation. Eenton 
and Scott selectc^TlXvo instruments, the Student Instructional Repojt (SIR) 
by Centra (JS72) and theilnventbry of Student Per;ceptions of Instruction 
^ (ISPI) by Scott (1973), that best e<xemplified the I'ational and empirical 
< apjiroaches to developing student evaiuationjoMnstruction instruments. 
These two instruments were adnjinistered at ifie University of Georgia in 
31 sections of freshman English that had a common final examination. A , 
random'half of each class was given the SIR and the other^half, the ISPI. 
Students were asked to supply their identification nun^bers and were . 
' assured th^it thtvresulls would be confidehtiaj. Mean self-reporJled CPAs 
and twoemjih^V^l sections of SIR (labeled "adjustment of individual needs" 
and *'work load *) were ^statistically significant predictors of class mean 
examination performance. The multiple cuireLtion obtained using the 
self-reported (3PAs and the sections of SIR as predictors was .62, There 
• was no empirical section or rativ *l section of ISPI or combination of 
sections with self reported CPA* .t contributed significantly to the mean ^ 
final examination ^cures bf the linglish classes. (The largest multiple R 
obtained was .42). The authors suggest that rc*sults of the stud> lend s^me 
support'tu tile ufle of instrumcMits developed empirically over those de 
veloped rationally. - 
^ There arc'certain problems inherent in the design of the Benton and 

Scott btud> that ma> have inRuenced the lack of relationship betxlkeen 
^ student ratings and final examination stores. One problem involved the 

luck of anon>mit\ of the ratings. S/udents may have responded differently 
if they had not been required to supply their identification numbers. An 
other factor that may have influenced the lack of relationship was the use 
uf the Common cssh> examination. Even though the researchers gave each 
instructor a list of rccommendajiioni for scoring essay examinations, it 
may be that the scores given by each insn^ictor did not .trjtrly reflect 
achievement in the course. Benton and Sco,tt did compare the means of 
self reported CPAs vvjth actual CPAs of a randomly selected portion of the 
_ sample. The means were nut signifit.anlly different, and the correlation 
actual CPAs and self-reported CPAs was .94, intjicating that the usi: of 
self repojcled CPAs in rese^irch of this nature is a defensible procedure. 

AU/p all, when achievement scores adjusted for ability are correlated 
with student ratings, most studies have foimd a great ideal of variability 
buj enough of a reiatipnship to warrant the use of student evaluation 
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insttuments for full-tiige professors. However, there is little support for 
^ using these instruments to examine the instruction of teachijig assistants. 

Studies Incorporating Achievement Scores Not Adjusted for Ability ^ 

In all the previously mentioned studies, students were either randomly 
^ assigned to course sections or the researchers incorporated some measure ^ 
of.ability to adjust examination scores. Simply adjusting scores for ability ^ 
is not an entirely satisfactory substitute for student randomization. Course ^ 
sections can be significantly different in other variables, such as moti- ^ 
vation. If the students in one coui;se section are more highly motivated , 
than students in another section, they may spend more time preparing^ ^ 
{ox thp examination regardless of the deficiencies in the instruction. It is * 
further suggested that many: L . 

researchers probably misuse ability pretests when residualizing achieve-^ 
ment and may remove from section to-section achievement variation the 
portion produced by differences in teaching ability in addition to the 
portion produced by difjferetwes in student ability. (Leventhal, Perry, ^'^4 ^* 0 
- Abrami 19^7, p 361) A J 

/ 1 ' . .y, I \ 

In spite such jcri tic ism, the researchers discussed in the prouous^ction . \ 
did makp some atteilipt to compensate for differences in ability of course 
sections. In the fpUowing studies (Orpen 1980; Bendig 1953; Cohqri and 
Berger 1970;,Bry$Qn 1974; Costin 1978; Hsu and White 1978; Blass iwj; 
andEndoandPella Piana 1976) apparently noattemptwas made toadjust' 
achiewyiient scores of course sections. Because of the possible initial dit- 
. (jprences tn^^e sectiqQs, the reported results should be interpreted cau- 
tiously, J ! ^ 

Even though noacijustmentAvas made for possible differences in ability 
of students in ten sections^^ifan introductory course in mathematics, Orp(^n 
* . (1980)'(;|i.d compare mean scdrc^n the aptitude pretest, consisting of^a 
^hort form of the Scholastic Aptitude Test,,£\hd^the mean grades the stu- 
dents expected to obtain prior to the f?narexamination. Results revealed 
no significant difference^^mong the sections on these two measures. The 
students completed the Teaching Rating Form (derived from the forn^ in 
Mckeachie, Lin, and Mann 1971). Each of the ten sections (taught. by 
different graduate students) used \he same contt^nt. textbook,' and assign 
ments. The^common examination was scored by the course directorand 
three specially trained graduate students. .Each sections average on :he 
final examination was correlated with the section subscale means on the 
Teaching Rating Form. Six of the eight correlations, ranging from .^2 to ^ 
,74, were significant. A multiple correlation of .75 was calculated using 
these six subscales together to predict the examination scores. Even though 
the results of this study are somewhat equivocal, overall they support the 
use of student ratings. This result is different from other similar studies 
^ where teaching assistants were the instructors, , 

, Even tl^ough sections were not significantly different on the aptitude 
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ptt^test or expected grades^/there arc^ other student characteristics that 
could have made the sections different. AIso» in jhis study the enrollment 
for each section was^ between 10 and' 12. This small enrollment is not 
typical of the other studies that have used multisections of courses^and 
may be the reason why this study using TAs got basically positive rela- 
tionships whereas most other similar studies did noU 

Bendig (1953)a1so investigated the relationship between course ratings 
and ach^ievement jn an introductory psychology course at the University 
of Pittsburgh. If hree of the five instructors for this course were predoctoral 
graduate :>tuuents. The three achievemeot tests in the study were ^11 mul- 
tiple-choice. were used in all the classes, and had been constructed on a 
departmental basis. Bendig found that correlations between instructor 
ratings and achievement varied greatly^from section to section. Only one 
of the five ratings correlafed significantly with ijchie.vement of the students 
(.37), and only one of the five section ratings correlated significantfy with^ 
achievement (.46). The total correlation of ,28 for the five sections of* 
achievement and course rating,, was significant. However, the total cor- 
relation of achievement and instructor ranting was not significant. 

The sum of each student's standard scores on the three achievement 
tests was the criterion measure. Course ratings and instructor ratings were 
determined by summing students' ratings on the Purdue Rating Scale f9r 
Instruction. Students' ratings forms were signed by the students, but they 
were assured that the instructors would not see the individual forms and 
that their grades would n6t be affected by their ratings. The small sample 
of five instructors greatly limits the findings of the study. The equivocal 
findings may have resulted from the use of both full time professor^ and 
TAs. 

. Cohen and Berger (1970) reported significant correlations between mean 
final examination performance and three dimensions and the total scale 
of the Michigan State University Student Instructional Rating Report 
vSIRR). The three dimensions of SIRR that were significantly correlated 
with achievement ^vcre "student interest" (.39), "student-faculty inter- 
action" (.37), and "course organization'* (.31).'l'he total scale correlaled 
.48 with achievement. None of the dimensions or the total stale correlated 
significantly with mean class grade point averages at the onset of the 
study. ' 

The sample of the study consisted of 25 sections of a basic natural 
science course at Michigan State University. The instructors had a course 
syllabus designed by the staff. Each instructor was asked to administer 
the SIRR "at his convenience" within a two- week period to one of his 
sections. The researchers do not state whether the instructors were pro- 
fessors or TAs. The final examination was a 100-item objective exami* 
nation that had been vall32ited, and 93 percent of the students who took 
the final examination completed the evaluation form. 

Bryson (1974) also examined Ihe \*elationship of student ratings and 
achievement of students. Subjects were 582 students in 20 sections of 
college algebra taught by 14 instructors who used a common syllabus and 
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textbook. The mean section scores on each. of 14 ilems,of a student rating 
instrument were correlated with students' performance on the Cooperative 
Intermediate Algebra Test. All ratings were anonymous, a strength many 
such studies do not have. Six of the corr<ilations were significant and 
rangedfrt)m .44 to .68. 

There are two apparent problems with the Bryson study. First, no 
attempt was made to adjust for the initial ability in the classes. Second, 
the evaluation instrument was not named. It was only stated that .the 
items "were selected from a routinely administered faculty and course 
evalviation form*' (p.l2j. No information is reported about the valiciity or 
reliability of the original instrument. If the original had acceptable valid 
ity a^d reliability, the use of a portion of the items without substantiation 
of si\bh use may have reduced these values. 

Costin (1978) reported significant correlations between meai] ratings 
of instructors of an introductory psychology course at the University of 
Illinois and the mean final, examination scores over a four-year time span. 
The four correlations ranged from .41 to .56. The number of graduate 
teaching assistants who were in charge of the classes ranged from 21 to 
^ 3. Ratings of the instructors weVc^anonymous. The percentage of students 
ratinig the instructors ranged from a low of 76 percent to a' high of 93 
percent for the four years. The final exahiinations were constructed by 
the supervisor of the course, and the instructors did not see the exa^ni- 
nation until it was administered. Although the evaluation instrument re 
mained the same for the four year period, the final examination in the 
study \vas not the same over those years. 

One of the criticisms of the Coslin study concerns the instrument use^ 
for the evaluation of the instructors. Five items were selected from.a 46- 
item instrument reported by Is.aacson ct al. (1964). Even if the original 
instrument reported by Isaacson ct al. possessed adequate %alidity and 
reliability, the use df only five of the 46 items raises seriuus questions 
about the reliability and validity of the "new" instrument. No indication 
of any recheck of reliability was reported. If ail 4^ Items did indeed rep- 
resent content validity, then the reduction of the instrument to five items 
probably reduced the content validity considerably . In contrast to a num 
berof other studies, this study does lend some support to the use jf student 
ratings of TAs. 

Hsu and White (1978) found significant correlations between achieve 
ment scores andstudents* ratmgs of instructors on two different evaluation 
forms. Thd overall correlations, relating scores with the factors of the 
instruments, were J4 and .6^ for the two instruments. The sample con 
sisted of 308 sjludents enrolled in 12 undergraduate education courses from 
West Chester State College in Pennsylvania^The instructors of the courses 
were six full-time professors. The two evaluatioti instruments were the 
Inventory of Student Perceptions of Instruction (ISPI) by Scott (1973) and 
the Instructional Improvement Questionnaire (HQ) by Pohlmann (1972). 
In the study, the same graduate assistant used standardized instructions 
to administer all the student evaluations. 



The ISPi was adminstered halfway through the semester, and the HQ 
was administered toward the end of the semester. It is. possible that results 
would have been different if the instruments had been administered closer 
in time. Certainly an evaluation of an instructor can change from the 
middle of the semester to the end. 

Another question arises in the Hsu and^ White study concerning the 
achievement measur<js. Hsu and White state that the first two scores were 
students' scores on the mid^term examinations and the third was the final 
examination score. Ordinarily an instructor does not give two mid-terms, 
so it is not clear how the three measures were obtained. No other infor- 
matipn is^given regarding the examinations. Also, it is not stated whether 
the same examinations were given for all the courses. If the courses were 
really different and common examinatlon^wcre not used, then the anal- 
yses in the study should be questioned. Overall, however, the study pro- 
vides support for the use of studentratings of professors. ^ 

Blass (1974) investigated the relationship bet\veen mid-term grades 
and course evaluation of students who were classified as "^.ubjective" and 
, "objective." The sample of the study consisted of 48 nursing students in 
an introductory psychology class at Brooklyn College, Brooklyn, New York. 
^When mid-term examii^ation scores were correlated with each of nine 
student evaluation-of-inst ruction items for all 48 students, six of nine 
correlations were significant. The range of significant correlations was 
from .34 to .60 for the total group. Also in the study, this positive rela- 
ti^onship between grades and teacher evaluations was true for students 
wfth low scores on the Blass Objectivity-Subjectivity Scale (classified as 
"subjective"), but was not true for students with high scores (classified 
as "objective"). The largest correlation reported between exaij^ination 
secures and any single evaluation item for the "subjective" students was 
.73, The iargest correlation reported between examination scores and any 
single evaluation item for "objective" students was .44. In the study stu- 
dents were asked to indicate their mark on the miJ term examination they 
had taken twf^ weeks previously. 

Efido and Della-Pifina (1976) found no significant correlations between 
student ratings and common final examinations for eight combined sec 
tiJns in = 1 1 1 ) of trigonometry at the University of U tah . Apparently there 
w<ire five instructors for the eight sections. No description of the rank or 
experience of the instructors was given. Correlations between student rat 
ings and achic.ement were also calculated for each instructor, but there 
were np consistent trends across the instructors. The highest co relation 
bet\veen any item and achievement for any instructor was .76. Over one- 
haljf the initial enrollment was not included in the results because students 
either did not turn in course evaluation cards or withdrew from class. The 
researchers admitted that this fact is a "serious sample attenuation which 
somewhat limits generalizability of results" (p. 84). The evaluation form 
used in the study consisted of seven items to be rated on a seven-point 
sca|e. The authors stated that the validity arid the reliability of the form 
arciquestionable; no reliability or validity data were reported.' 
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In summary, the studies using achievement scores not adjusted for 
ability compared to student ratings showed more variation than studies 
using other types of comparisons. However, in general they tended to 
support the use of student ratings of professors. Student ratings of TAs 
were highly varied across the various studies, perhaps too vaiied to merit 
their use. 

Studies Incorporzrting Achievement Scores and Sections Selected 
, Without Identity of the Instructor 
In addition to the Fre> (1976) study mentioned earlier, three other studies 
have been reported in whichstudents selected their sections without know- 
ing who thtir instructors were to be. Although this procedure is defensible, 
it is not as rigorous in research design as random assignment would have 
been. Certainly the procedure is better than sirnply ignoring the fact that 
differences between the sections might exist at the onset of a study. In 
two of the studies, the rQSearchers stated that a pretest indicated no sta- 
tistical differences in the initial ability of the students in the sections. 
However, it is possible that the sections were different in other critical 
areas than those evaluated by the pretest. 

^ The Marsh. Flciner. and Thomas (1975) study involved 18 sections of 
an introductory course in computer programming at Ihe University of 
California at Los Angeles. Students chose sections on the basis of the time 
^ the sections met without any knowledge of who would teach each section. 
A 46-item evaluation-of instruction instrument developed at the Univer- 
sity of California was used. The section averages of 12 of the 46 items were 
significantly correlated with the average ofthe student examination stores 
for the sections. A multiple correlation of .74. using four of the 12 signif- 
icant items as predictors of average :,ection achievement, was also sig- 
nificant. In addition, two factors of tfie instrument, "course organization" 
and ''class presentajions." as well as two summary items correlated sig- 
nificantly with achievement? These correlations were .55. .43. .44, and .42, 
respectively. 

In the Marsh. Fleiner, and Thomas stud> only 72percentof thc:students 
completed the evaluation forms. Also, students in the study were asked 
to include their registration numbers on the evaluation f6rms. Even though 
the students were assured their evaluations would be anon>mous. it is 
possible that the results would.have been more valid if students had not 
been asked to include their registration ^numbers. A random spot check 
indicated no variations in the storing of the objettive final examination. 
The sections of classes in the study were generally taught by graduate 
students who used a common toursc outline developed b> the direttor of 
the course. A major strength of the stud> is that the instructors had been 
randomly assigned^ to the sections, 

A second study in which students selected their sections without know 
ing who the instructor was to be was the Marsh and Overall (1980) study. 
The subjects, again, were students enrolled in 31 sections of a course in 
computer programming application at the University of California at Los 
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Angciei. Instructors were, again, mostly graduate teaching assistants whq 
were supervised by a course director who had developed the final ex- 
amination. There were no statistical differences on pretest measures of 
ability and interest in the "il sections. Results of the study were based on 
the 73 percent of the enrolled students who completed the required ex- 
aminations and forms. As in the previous study, students were asked to 

. supply their registration numbers. The evaluation form consisted of 33 
items intended to measure seven factors of teaching. 

Partial correlations were calculated between ratings given by students 
at mid-term and at the end of the term and criteria of effective teaching. 
Ratings given at the end,of.thc term correlated higher with the criteria 
, than diu the ratings given at mid-term. Regarding the end of term eval- 

* uations, the finaj examination correlated highest with an "instructional 
improvement" item (.42), "overall instructor" item (.38), and the factor 
labeled ''instructor enthusiasm/concern" (.40). When the results of this 
study were compared with the Marsh, Fleiner, and Thomas (1975) study, 
Marsh and Overall (1980) statt^d: 

Both sttuUes reported that achievement was significantly related to overall 
instructor and instntctioual improvement summani ratings but was not 
significantly correlated with overall course ratings. The two studies did 
not, however, agree on which specific components of the student ratings 
were most highly correlated with final examination perfonnance. In par- 
J ticular, the Organization factor that was most highly correlated with final 
examination perfonnance in the earlier stttdy was not significantly cor- 
related with any of the criteria in this study, (p. 474) 

The inconsistent results ^ the two studies is especially interesting since 
the samples, courses, examinations, and the procedures were the same or 
similar. 

In a third study, one conducted by Braskamp, Caulley, and Costin 
0979), instructors during two subsequent semesters were assigned to sec- 
tions after students had registered. Thtre is no indication that the re- 
searchers checked for, nor controlled for, any possible initial differences 
in the sections. Instructors in the study were teaching assistants of a 
^ psychology course at a "large midwestern university." None of the three 
global items or the five scales of a student rating lovnx significantly cor- 
related with student performance on a final examination for the fall se- 
mester group. For the spring semester group, only one of the scales, labeled 
"teacher control," was significantly correlated with achievement (.58)^ 

In the study, 80 and 79 percent of students completed the evaluation 
form for the two senjpsters. The r^lcarchers reported Kuder-Richardson 
(KR 21) reliabilities of .83 and .86 for the multiple-choice final Exami- 
nation. The researchers reported that 23 instructors taughr47'sections 
one semester, 19 instructors taught 38 sections the other semester, and 17 
of these instructors taught the course both^emesters. Means in the study 
were calcualated by averaging the students' scores in all sections taught 
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by each instructor. However, in one table the means reported were based 
on 19 and 17 instructors for the two semesters. The researchers did not 
explain how these 19 atid 17 were chosen, but apparently data for four 
instructors one semester and two instructors the other semester werp not 
included in the analyses. 

_ Generally speaking, the studies in which initial differences betwet^n 
sections is somewhat controlled for by having students select their sections 
without knowing who the instructor is to be have not shown a'yery con- 
sistent relationship between student ratings of their inst|TJctors gr.d stu- 
dent achievement. It should be noted, however, that these instructors were 
TAs rather than full-time, experienced professors, 

A Mcta-Analysis of Student^Ratings and-Student Achievement 
One of the rpost recent as well as most important studies concerning 
student ratings and student achievement was a meta-analysis by Cohen 
(1981). Meta-analysis has been defined as an "analysis of analyses*' or "the 
statistical analysis of a large collection of analysis results from individual 
studies for the purpose of integrating the findings" (Glass 1976, p. 3). Cohen 
integrated and rea^nalyzed primary data analyses from some 41 indepen- 
dent validity studies that had incorporated 68 multisections of course 
ratings in the prediction of student achievement. 

The average correlation reported in the studies between student 
achievement and an overall course rating (available in 22 of the 68 mul- 
tisection courses) was .47, and the average correlation between student 
achievement and an overall insiructor rating (available in 67 of the 6*8 
courses) was .43. Cohep reported that if no relationship existed betNyeen 
student achievement and overall course ratings or between student 
achievement and an overall instructor rating, then an equal number of 
positive and negative correlations would be expected, with the majority 
of the correlations around xero. However, the majority of the courses 
reported positive relationships. "Instructors whose students achieved the 
most vvere also the ones who tended to receive the highest instructor 
ratings" (Cohe^i 1981, p. 296). 

Cohen also reported the average correlations bcUvcen atliicvemeht and 
seven specific teaching dimensions. None of the 41 studies had reported 
all these seven correlations. The average correlations between achieve 
ment and the teaching dimensions were, sliill (.50), structure (.47). feed- 
back (.31). rapport (.31), evaluation (.23), interaction (.22), and course 
difficulty ( .02). The average correlation for student progress^ students* 
self ratings of their learning, and achievement (reported in 1 1 of the stud 
ies) was .47. Cohen concluded: 

While large effect sizes are found for the Skill and Strttcture dimenuons, 
other dimen^ons such as Rapport, Interaction, Feedback, and Evaluation 
show more modest effects. The Course Difficulty dimension shows no 
relationship with student achievement. Finally, students' selfratmgs of 
their learning correlate quite higfdy with student achievement, (p. 29S) 
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Cohen says thai his mela-analv provides strong support for sludcnl 
evaluation-of-instruction instruments as a measuii: of teaching clfectiNc 
ndss when the effectiveness is dcfined.as achievement in the course.. The 
^ data also seem to indicate that in using a student evaluatioh>o(-instructiun 
^ instrument the greatest emphasis of teaching effcxtivencss ^houkl be plated 
on an overall course bating item, an o\erall instructor rating item, or on 
factors that measure skill, structure, or student progress. Emphasis should 
not be placed on rating factors that relate to course difficulrv. « 
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Relationship of Student Evaluation 
and Other Criterion Measures 

All ihc.previously mentioned researchers have used scores on course cx- 
aminations in establishing criterion validit>. Other res'^carchcrd have em- 
ployed other criterion measures in conjunction with achievement. These 
have included students* gains in the course (Morsh, Burgess, and Smith 
19S6)» students* scores on a nittional e.xanii.nation (Gessner 1973), scores 
on a problem-solving exercise (Wiviolt arid Pollard 1^74), and students* 
interest in advanced courses and attitude toward the course subject 
(McKeachie, Lin, and Mcndelson 1978)/ 

Other .researchers have used criterion measures that did not include 
student achievement.Amongtheseare students* interest in advanced courses 
(McKeachie and Solomon 1958), judges* ratings of video tape clips of 
instructors (Stallings and Spencer cited in Aleamoni and Spencer 1973), - 
and the use of ratings of trained observers (McKeachie anj Lin 1978). 

Studies Using Criterion Measures in Conjunction with Achievement 
One criterion 6f teaching effectiveness could be gains that students make 
in a course, ^orsh. Burgess, and Smith (1956) correlated student gains 
on a test of knowledge with instructor ratings. The> also used gains on a 
performance examination*. The gains made on the written examination, 
the gains made on the performance tcot, and the<.ombined gains correlated 
significantly with the overall ratings of the instructors (.32, .39. and .40, 
respectively). Whun only student ratings of the instructors" teaching abilit> 
were correlated with the three gains cri^teria, the correlations were slight l> 
higher (.41, .41, and .46, respectively). 

In the study, complete data were available on 106 121 instructors 
of a hydraulics phase of an aircraft mechanics course at Sheppard Air 
Force Base.'Classes consisted of about 14 students eacii, and this phase of 
the course lasted only eight days. One possible confounding variable in 
most of the reported studies is that the instructors who were being rated 
administered, the criterion tests. The way the students felt about this ex 
aminer concei\abl> could have affected their performance on the criterion 
4est. In contrast, a strength of the Morsh, Burgess, and Smith (1956) stud> 
is that the criterion tests were administered by personnel other than the 
instructors of the classes. 

Other variables that were compared with gain scores were peei rank- 
ings and supervisor ratings and rankings, verbal facilit> ratings, instriic 
tors* knowledge of hydraulics, instructors* general intelligence. Morsh, 
Burgess, and Smith (1956, p. 86) con?:luded that "student ratings of theii 
instructors were the only instructor measures which seemed to predict 
the student gains criterion.** Although the study involved instructors at 
an Air Force base, the researchers suggest that the results "would find 
application to other teaching situations" (p. 87). 

"A confounding factor that has not been sufficiently recogni/ed is that 
in many instances the persons who developed the measure of .tchievement 
and the persons who were rated by the students were presumably the 
same individuals" (Costin 1978, p. 86). Gessner (1973) recognized this 
problem and used not only departmental examinations but also a nation- 
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ally normcd examination that the instructors in the study had no pcirt in 
developing. ' )i 

^♦udents in the Gessner study were 1 19 second-year medical students 
in asic general science course. Ton faculty members taught the 23 
subject areas in the course. Students attending the last lecture o[ the course 
evaluated the subject areas with regard to "content and organization" 
and "presentation," A three-point scale was used; good, fair» or poor, and 
ratings were assigned values of +'1, 0, and ~ 1. A weighted mean rating 
for each subject area was calculated. A departmental committee prepared 
deparimenta^J examinations from questions submitted by the individual 
faculty members. On this examination, performance for an area was de- 
termined as the mean clas^ performance for that area. Five weeks after 
the course was completed,^ 1 16 students also took Part I of the National 
Medital Board Examinatitjin. Questions from this examination were class- 
ified into the subject iirea^ by two members of the faculty. The difference . 
between the percentage ^f the class and the nationwide sample who an- 
swered each question correctly was calculated for each item* These units 
of differences were averaged for each subject area and were used as the 
measure of class perforpiance in the subject areas of the national exam- 
ination. It is not clear \vhy Gessner chose to use these units of difference. 
The use of the class and the nationjal group in the study has been criticized, 
and one writer states that the design of the Gessner study lacked internal 
validity (Leventhal 1975). \ 

The significant correlation between class performance irt the subject 
areas on the national examination and ratings on "content and organi- 
zation" was .7?, and for the subject areas of the national examination and 
"presentation" the correlation was .69. However, when partial correlation 
coefficients were calculated for these variables, with "relative emphasis" 
(the amount of time devoted to a topic) held constant, the correlations 
dropped slightly to .74 and .62. The correlations between class perfor- 
mance in the subject areas un the departmental examination and the two 
rating dimensions were only .1 1 and .17. respectively. Ge.ssner concluded 
that: 

// appears quite clear that student ratings of instruction and class per- 
fonnance on national examinations are positively related, the higher the 
student ratings of the instruction they receive, the higher t}ie class score 
relative to a nationwide nonn. On the other hand, no significatU corre- 
lation is found between student ratings and class performance on insti- 
tutional examinations. This suggests that both stud^it ratings and class 
perfonnance on national nortnative examinations are valid measures of 
teaching effectiveness, (p, 569) 

A readily apparent problem in fhis study is the loss of 41 (of the 1 19) 
students who did not attend the last ^ecture and. therefore, did not rate 
the subject areas on the two dimensions. The* results may well have been 
different if the ratings of these 41 studenf^had been included. 

\ 

\ 
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Too, one could question .the two dimensions that were chosen for the 
ratings. The author did not describe the rationale used for choosing those 
particular dimensions. Although the students rated 23 subject areas, only 
20 ofthesc were used in l^e computation of correlations. The other three 
areas ^re not accounted for. It is possible that these dimensions of the, 
course were not included in ihe National Medical Board E.^aminatiori. 
Certainly, since the two achievement measures were calculated \n different 
ways, there is some question in making comparisons of the correlations 
involving them. Overall the study adds credibility to the ust of student 
rSitmgs. « ' 

In addition to using an achievement test as a criterion, Wiviolt and 
Pollard (1974) also used a problem-solving exercise to measure "ability 
to analyze, synthesize, and evaluate course content" (p. 37). Resulfsof the 
research suggested that student ratings of instruction were not related to 
scores on the achievement test and only contributed "slightly" lo a regres- 
sion model for problem solving. 

Again, as in previous studies, one of the criticisms of the Wivipll and 
Pollard study is the small number of course sections. The sample consisted 
of six introductory educational psychology sections at the Uni^versily of 
Wisconsin-Milwaukee. Whereas the researchers reported the sdmple was 
composed of 138 undergraduates in the sections who had completed the 
criterion measures and the course evaluation forms, one does not know 
th^» percentage of students enrolled in the course who were elfminaled 
from the study. ^ ♦ 

Subjects in the study were assured that their scores on the criterion 
measures would not affect their grade in the course ond that their scores 
would not be given to their instructors. Since anonymity was not provided 
for, it is possible the scores were not true measuresof the students' ratings 
of their instructors. j ^ 

It is tiol clear from the report whether the instructors of the sections 
were teaching assistants. The researchers slated, however, that leaching 
assistants had administered the student evaluation instrument. The re- 
searchers administered the tasks. A model answer was used to score the 
problem-solving exercise, and "inter-rate" reliability was reported to be 
.78. 

In addition to sxores on an achievement ex?>iminalion, McKeachie, Lin, 
and Mendelson (1978) also used inlenist in acjvanced psychology courses 
and attitude toward psychology as criterion measures. Although' most 
rest'archers have measured achievement by a final examination given at 
the end of the grading period. McKeachie, Lin, and Mendelson also looked 
at; delayed measures.'Tfiey slate: 

Probably the oldest objection to stitdent ratings is the comment, *7 did 
. not really appreciate some of my best teachers until sometime after the 
course had ended" Another common quote is^ **Most of what a student 
learns and puts on a p'tai cxannnatiojt is forgotten b\ the next week/' (p. 
352) 
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McKcachic, JLin, Mcndclsun, (1978) therefore, eompared students' in- 
terest jn advanced psychology courses, scores on the Introductory Psy- 
chology Criteria Test, and scores on the Attitude Toward Psychology Scale 
. to student ratings of instruction at the end of an introductory psychology 
course and again 14 months after completion of the course. The ratings 
of instruction Were not highly related to the*criterion» measures at eUher 
of the two time periods. The measure of attitude toward psychology was 
the only follow-up criterion that had a rank order correlation with student 
ratings that the uuthois labeled as '*substiintiar'''(.66). 

The sample consisted of only six instructors at the Uniyersity ol Mich- 
igan who were all advanced graduate students. The student rating in- 
strument consisted of items derived from the form reported by Isaacson 
et al. (1964). In the study, the researchers \vere able to locate 124 6T the 
original 152 students enrolled in the six courses. Students were sent letters 
with the quest* laire and were offered three dollars to complete the 
questionnaire, :hc students located, 92 (74%) responded (61% of the 
original s^impl Ihe researchers reported that the respondents did not 
differ :>igp..tiL ^n.Iy from the non respondents on tlie measure that ha<J been 
compk*. : . t!.w end of the course. On the questionnaire, only ten items 
of 48 ih^ I .**:re used At the end of the cou^^e were included from the 
Introductu v Psythulogy Criteria Test. Other than the positive relationship 
of attitude to%vard psyehology and student ratings, the stud> does nut lend 
muLh support to student ratings* it should be noted that the instructors 
wercTAs rather than full-time professors. ^ 

Studies Using Criterion Measures Other Than Achievement 
Final examinations are the most commonly u^ed criterion mcasuixs of 
teacher effectiveness. Man> instructors argue tlut bome of their most im 
portant objectives cannot be measured by final examination scores. The 
ability to arouse interest in the subject matter should be one of the criteria 
of effective teaching. It has been stated: 

While awakened iJi/tm/ 16 not an educational onteutne, \\t nu^ht espcLi 
thai when a teacher /la^ aroused htterest in hh field, :^tiuleuts will be 
likely to elect another course in that fieUi Thusln (^inpariiti^ the effec 
tixwess of instr'ttLtors in a niultisection course^ we nii^ht compare the 
percentage:^ of their students who elected advanted courses, (McKea^-hie 
audSohiiion 1958, p. 379) 

•9 

McKeachic and Solomon, then proceeded to validate instructor ratings 
against ihe percentage of students who took ad\anced courses. Data were 
collected ove5a period of about three years from students in about eight 
advanced ps>cholog> coui'scs. Students were asked to report the instructor 
and semester they had taken the beginning psychology course. At the end 
of some semesters students in the beginning psychology course responded 
to two items that« had to do with an overall rtiting of the instructor's 
' effectiveness and an overall rating of the course. Instructors were ri\nked 
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on the ratings on the two questions and on the perce itage of students 
taking advanced courses. Jn two of the five semOster&» the ratings of the 
instructors were significantly correlated with tlje perce itage of students 
electing tc* take advanced coursos-(:63l ^nd .41). 

Stallings and Spencer (cited in ^leambni and Spencer 1973ji, employed 
a different type of criterion measure, They compared s udent ratings of 
nine instructors of accountancy at the University of Illinoi> with ten judges*^ 
ratings of the instructors. The judges, who were measurement specialists 
and teaching assistant* ^.n the speech department, rate* 1 video tape clips 
of the instructors on a three-point scale. Students rated 1 16 instructors on 
/he Illinois Cours'e Evaluation Questionnaire, (CEQ). To al scores on the 
CEO for each instructor were averaged and ranked, and the average in- 
structor ratings by the judges were also ranked. A signi leant Spearman 
.rank order correlation of .70 was obtained between rank on the CEQ and 
the average rating rankj. 

Finally, McKeachie and Lin (1978) compared student ratings with rat- 
ings given by trained observers. McRcachie and Lin used graduate student ^ 
observes trained in a categorization system to evaluate 20 teachers la 
three introductory psychology courses at the University of Michigan. Tne 
researchers report that such data are difficult tu. obtain because of the 
costs in training observers. Sjnce a factor of many student evaluation of 
instruction instruments is "rapport," student ratings on that factor of the 
Student Perceptions of Teaching and Learning were corplated with ob- 
served teacher ac/s relating to "Warmth" and "agreement/.The "rapport"* 
factor jx>nsisted of three items that related to the instructor being per- 
missive, friendly, and invitjng criticism. One correlation )etween the stu- . 
dents* rating.sr of the instnictor being friendly and tht^^pbserved behavior 
of "agreemejit" was significant (.61). McKeachie and Lm state tliat the 
"study lends some empirical support to* the presumption that student 
ratings of teaching are based on teacher behavior" (p. 47). 

Teachers irj^^the study ranged in experience fft)m zero to 27 years. The 
graduate siiidents observed each class appro.ximately si.x times during the 
term. Sin^Ce the three introductory psychology courses we *e not described 
in the stydy, it Is perhaps possible that the nature of the i ifferent courses 
could causejhe^ame teacher to^have<iuite-<!t£fcrerilj;atinij;s on "rapport," 
".warn^^h*," or "agreement" for the different courses. ^ 

Ta^en together, comparisons of teacher ratings to criterion measures^ 
other than achievement lend some criterion validity to^the use the 
ratings. However, as in the use of achievement as a criterion, these com- 
parx^ons with other mcasua\s indicate there is a great duiti more tu eval- 
uating, instruction than tan be accounted for in this tvpe ^f comparison. 

\ 
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An overall examination ofcriterion validity studies ofstudent evaluations 
of college instruction suggests four major observations. The first of these 
observations is that the majority of the investigations* cited in this mon- 
ograph reported sign1fica;at positive correlations between stfjdent ratings 
of instruction and criterion liieasures held to be measures of effective 
teaching. Therefpre, there appears to be sufficient criterion yajiclitydata 
to support the use ofstudent evaluations of college insttoiction. This syn* 
thesis of the fmdings of many studies indicates that the student evaluations 
of instruction arc ;apping into an important dimension of teaching. Thus, 
stddent evaluations can be a defensible part of an instructor's evalualtion 
and can contribute to t'he improvement of feaching. „ 

The second major observation that c^.. Ov Vawn from these studies 
is that the relation!)hip between student evaluations of instruction and 
criterion measures is by/io me£,ps unrfect. Although the majority of the 
investigations cited in this monograph reported a significant positive cor- 
'relation between ratings and criterion measures of effective teaching, that * 
correlation was almost always a modest one. Several researchers reported 
nusignificantcorrelations. Apparently a great deal more goes into effective 
teaching than can be easily evaluated with student evaluation-oMnstruc* 
( tion instruments, therefore, these evaluations bhould not be the sole ve- 
hicle for judging the cffectiven^'ss of instruction. 

Student evaluation instruinents have been increasingly widel> used in 
making decisions about tenure, promotion, -and merit, pay of college in- 
^ structors. Qbviously, Considering the mpdest correlations and the rrftith- 
cxlolugital problems in the literatucu cited in this mdQograph. these ratings 
should nQt be the sole criterion. Aleamorii (1976). states that it would be 
invalid to use student ratings as the only basis for decisions about an 
instructors effectiveness. He further adds; . « * ' 

It is important that mstmctioital ex'aluation systenus designed for admitt 
istrative personnel decisions indtide evaluations ofcolleai^ues, churse con 
tent, course materials, course objectives, instructor self-ratings, quality of 
student learning, and so forth, in addition to student ratings, (p^ 609) ^ 



Even when s.tudent ratings are not used for personnel decisions and 
arc used only for the impnuyement of instruction, the instructor shoulcT 
realize that the Research indicates that student evaluations do no; tell the 
whole story. Additional sources of feedback would appear to be needed. 

A third gent^ral observation is that there is a discernable trend in the 
fiequenty with which certain student fating variables appear as leading 
indicators of effective teaching. First, items relating to an overall rating 
of the instructor or overall scores on the instrument were often listed as 
significant predictors of teaching effectiveness. Cuhcn (1^1) also pointed 
out in his meta-analytic study that an overall course or an overall instruc- 
tor item correlated highly with student achievement. Such an overall item 
or an overall evaluation score is of some use in decisions about tenure.^ 
promotion, and merit pay, but it would not be very useful in the improve- 
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ment of instructiolt, dne of the primary reasons for student evaluation-of- 
instru^tion instruments. Teachers need much more information relating 
to specific strengths and weaknesses if they are to make any adjustments 
in t^eir teaching. 

Benton (1979) ex^urfmed 19 studies .that reported' the na;^^ of factors 
of student evak^tion of instruction instruments. He roundmal the 113 
named fact^r^f the various studies could be classified into eight cate- 
gories^TW?se categories, in order of the frequency of appearance, were: 
skni^Jlinstructor, student-teacher interaction, course organization and 
intent, feedback to students, course difficulty and workload, motivation, 
importance of the course, and attitude of instructor. The overall exami- 
nation of the studies reviewtJ in the present monograph indicates thlil 
, . the first three categories listed by Benton were often listed as significant 
predictors of some measure of teaching effectivene*ss. ' 

The category mi/jt often mentioned in the studies as a significant pre- 
dictor of teacliing effectiveness related to the skill of the instructor. This 
category appeared more than two times as often as the second best pre- 
dictor. Factors labeled/'skill," "lectures." "presentations," "presentjfiion 
clarity," "presentation skill," "expository skills." and "class presenta- 
tions" were included in this skill-of-the-instructor category. 

The second category most often listed as correlating significantly with 
achievement \vasorgani^ation and content. Factors labeled "organi^ation- 
planning." "planning," and "course organization" were included in this 
category. 

The third category most often found to significantly correlate with 
instructor effeciivenesb was interaction and included factors labeled "in 
tcraction" and "student-faculty interaction." Although factors relating to 
the other five categories reported by Benton were sometimes reported ^5 
significant predictors of effective teaching, the infrequency of thcir ap- 
pearance gi< est he.n much less credibility than the three mentioned above. 

It is interesting to note that in the meta-analysis reported by Cohen 
(1981) sJdll and structure correlated "highly" with achievement, whereas 
interaction correlated "moderately" with achievement. The present study 
and the Cohen meta analysis seem to indicate that factors iclating to skill 
of the instructor and to organization and planning or structure are factors 
that correlate highest with teacher effectiveness. Therefore, in selection 
of an instrument designed to measure teacher effectiveness, it is recom- 
mended that an instrum.cnt should definitely possess these two factors, 
and that they should carry more weight in faculty tenure.jiromotion, and 
pay decisions thim other factors of the instruments. Further, instructors 
^ :»eekipg to improve their teaching should attend most carefully to iWse 
factors. ^ / " 

The fourth major observation drawn from the present analysibjs that 
a definite need still exists for more criterion <fahdity studies of^^tudent 
evaluation of instmctiun instruments. Future reseaahersin this <)rca need 
^ to give i.*ure attention lo t!ie methodological problems previously cited. 
Specifj^cally ,of the fecomnr^endations listed by Benton (1974). th^ following 
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* appear tu have application for Tuture studies investigating the relationship 

• of ratings and criterion measures: 

1. Subjects should Ho randomly assigned to sections, and then instruc- 
tors also should be randomly assigned, 

2. \ large number of sections should be used, 

3. Subject-matter content should be essentially the same across the 
sections. 

4. Examinations with better psychometric qualities should be used 
(such as rationale for devising and revising items and validit> and 

^ reliability information), 

5. In addition to achievement^ other appropriate criterion measures 
" that cover the spectrum of instructoi objectives should be used (e.g., 

attitude measure^. 

^Iso more standardised procedures for administration and scoring of in 
strumcnts should be evidenced in future reports. 

It would appear to be advisable to replicate studies that have already 
been reported. If student rating forms with adequate reliabilit> and va 
lidity information are used, one would be justified in determining whether 
the relationships reported in the reviewed studies could be further gen 
craii/.ed. In addition, further research that is most ut^entl) needed is. 
(1) the comparison ofratings of TAs and full time professors, (2) the effects 
•ot the time during the semester or quarter the forms are administered on 
the ratings of the instructors, (3) criterion validit> studies that involve 
advanced classes, (4) criterion validity studies that involve giaduate classes, 
and (5) the comparison of rating forms developed empirically and those 
developed rationally. 
, ' * The table on pages 36-40 is a summary of the studies reported in this 
monograph tliat examined the relationship of student ratings of instiuc 
tion and criterion measures. Although parallel data.weie not reported in 
^ all the studies, the.table shows the largest significant correlation leported 
in each study. 

. These largest correlations are squared to indicate the proportion of 
variance (common variance) shared by the two variables the criterion 
and thO student ratings. Common variance has to do with the variation 
in one variable that ca'n be attributed to its tendency to \ar> \vith the 
other. For example^ if an obtained correlation of .50 is squared, the re- 
sulting value is .25. This' indicates that we know 25 percent of what we 
nee'd to know to make a perfect prediction of one variable (the criterion) 
from the other (the result of the student rating). 

The range of the significant correlations { - .75 to .96) indicates that 
the findings are highly inconsistent. Further, when there vyas a significant 
relationship between the ratings and the criterion, the amount of variance 
accounted for was usually not largv .Examination of the table suggests 
several possible reasons for the inconsistency of the findings. One of the 
most obvious possibilities is that many of the studies were based on small 
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Studies Examining Relationship of Student Ratings of Instruction and Criterion Measures 



Study 


Sample 


Studeut 

Evaluation 

Instrument 


Largest 
Significant 
Correlation Reported 


Largest 
Significant 
Correlation Sqtiared 


^cndig f 1953) 


5 sections of intro. psy- 
cliology 


Purdue Scale for In- . 
struction 


.46 . . . 


.i\ 


Benton Si 
Scott (1976> 


3 1 sections of freshman 
English-' 


. Student Instructional 
Report, Inventory of 
Student Perceptions of 
Instruction 


.62 

• 


.39 

r 


BIass(1974) 


1 intro:.psycl)oIogy 
course' 


Course Rating Sheet 


.73 


.53 


Braskamp. 
Caullcy. & 
Costin (1979) 


19 and 17 instructors of 
psychology 


3 global items, items 
from Costin (1971).. 
items from form de- 
scribed by Isaac^n et 
aL(1964) 


.58 


.34 

• 


— I 

Bryson (1974-) 


20 sections of college al* 
gebra, 14 instructors 


12 Items from a "roa- 
tinely administei^cd fac- 
ulty and c6ui'se 
evalu^ition form" 


.68 


.46 


Canaday, 
Mendelson, & 
Hardin (1978) 


one-section anatomy 
course 


total scores on a 31 -item 
questionnaire (further 
described) , 


.42 


.18 
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Centra (1977) 



Cohen & 
Berger(1970) 



72 sections of 7 different 
•courses taught by 74 ^ 
teachers (no TAs) 



Student Instnuctional 
Report 



.96 



.92 



25 sections of a basic 
natural science course 



Michigan State Univer- 
sity Student Instruc- 
tional Rating Report 



.48 



23 



Costin(1978) 



96 TAs of intro. psychol- 
ogy over' a 4-year period 



5 items from a form de- 
scribed by Isaacson et 
al.(1964) 



.31 



Doyle & 
Crichton 
(1978) 



12 instructors of com- 
munications 



4 items from factors 
identified by Doyle & 
Whitely (1974) + 2 sum- 
mary items 



Doyle & 
Whilcly 
(1974) 



12 TAs teaching begin- 
ning French 



Student Opinion Survey .51 



.26 



Endo & Della- 
Piana (1976) 



8 sections of trigonome- 
try, 5 instructors 



Associated Students of 
the University of Utah 
Course Evaluation 



.76 



.58 



Frey (1973b) 



8 instructors cif intro. 
calculus & 5 instructors 
of multidimensional cal- 
culus 



A Northwestern Univer- 
sity instructional rating 
form 



.91 



t83 



ERIC 
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Frcy(!976) ♦ 7 sections of inlro. cal- not specified .90 .81 

cuius * " * 



Frey, Leonard, 
& Beally 
(1975) 


12 and 5 sections of in- 
tro. calculus, 9 sections 
of ed. psychology 


Endeavor Instructional 
Rating Form 


.85 




.72 


Gessner(1973) 


10 faculty members 
leaching 23 subject 
areas of a basic science 
course 


ratings of each of the 
subject areas regarding 
content, organization, 
and presentation 


.77 


- 


59 


Hsu & While 
(1978) 


12 classes of undergrad- 
uate education courses, 
' instructors 


Inventory of Student 
Perceptions of Instruc- 
tion, Instructional Im* 
provement Question- 
naire 


.74 




55 


Marsh, Flciner, 
& Thomas 
(1975) 


18 sections of intro. 
computer programming 


46-item instmmenl de- 
veloped at the Univ. of 
California, Los Angeles 


.74 




.55 


■ ■ ■ f 

Marsh & Over- 
all (1980) 


31 sections of imro. to 
computer programming 
applications 


33 items from 7 factors 
and 3 summary items 


.42 




.18 


McKcachie & 
Lin (1978) 


20 instructors of 3 intro. 
psychology courses 


"rapport" factor of Stu- 
dent Perceptions of 
Teaching and Learning 


.61 




J7 



McKeachie, 
Lin»& 
Mann (1971) 



5 ^udies: 

(1) a, 33 seclioaa-9^cri- 
cral psycholbgy ^^"^ 
b.^34 ^si^ciions-pt 
general psychology 

(2) 32 sections of gen- 
eral psychology" 

(3) 6 instructors, no. of 

' sections not specified 

(4) 16 sections of 2nd yr. 
French 

(5) Intro, economics 
^ taught by 18TAs 



7^ 

.variations of the Isaac- 
son c|al..(1964) scale 



(1) a. :42 
b. .40 

(2) 

(3) .65 

(4) 



(5) .72 



(1) a. .18 
b. .16 

(2) 

(3) .42 

(4) 



(5) .52 



McKeachie, 
Lin, Si Mend- 
clson (1978) 


. 6 sections of intro. psy- 
chology, 6 TAs 


items derived from form 
described by Isaacson et 
al.(I964) 


.66 




.44 


McKeachie & 

Solomon 

(1958) 


approximately 8 ad- 
t'anced undergrad. psy- 
chology courses over a 3- 
year period 


2 global items 


.63 




.40 


Morsh, Burgess, 
& Smith (1956) 


106 instructors of a hy- 
draulics phase of an air- 
craft mechanics couiise 


global item + 4 ratings 
on qualities of the in- 
structors 


.46. 


L 


21 


Orpen (1980) 


10 sections of mathe*^ 
matics,10 TAs 


Teaching Rating Form, a 
version of the form of 
McKeachie et al. (1971) 


.75 


7 

/' 


36 



>3 



9 
I 

S. 
I' 



o 

ERIC 



46 



Rodin Si Rodin 
(1972) 


"teaching assistants'* of 
12 sections of under- 
graduate calculus 


one global item 


-.75 • 


.56 




StaUings & 
Spencer (cited 
in Aleamoni & 

^nf^rtr^ot* t 07^1 


9 instructors of begin- 
ning accounting 


c 

Illinois Course Evalua- 
tion Questionnaire 


JO 


.49 




Sullivan & 
Skanes(1974) 


130 sections of 10 differ- 
ent courses 


researcher-designed 
form 


.53 


.28 




Turner & 

Thompson 

(1974) 


16 and 24 sections of be- 
ginning French taught 
byTAs • 


30 items frojn scale of 
Deshpande. Webb, & 
Marks (1970) + 5 addi- 
tional items 


-,52 


.27 




Whitely & 
Doyle (1979) 


5 professors, 1 1 TAs of 
"beginning mathemat* 

ICS 


Student Opinion Survey 


.80 


.64 


• 


Wiviott & 
Pollard (1974) 


6 sections of intro. ed. 
psychology 


25-atem rating scale and 
the grade A-F assigned 
to the course 
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sample sizes. Very few of the studieb had large enough sample sizes to 
merit confidence in the stability on the results. Approximately one-third 
of the studies were based on samples of less than ten sections. 

A second possibility results from the diversity of the types of courses 
that used the evaluation forms. Among the subject areas reported in the 
studies were psychology, English, mathematics, science, communications, 
French, government, and computer programming. Most of the courses 
were beginning courses; a few were advanced. It ib entirely possible that 
one subject area may require a different type of teaching than another, 
and t)iat the type of teaching in advanced courses is very different from 
the teaching in beginning courses. Perhaps in the advanced courses the 
number pf students in various sections would be smaller than the number 
of students in beginning courses^ and this difference could affect the rat- 
ings. None of the studies concerned graduate classes, yet many colleges 
and universities are presently using evaluation forms in graduate classes. 

Another possibility for the diversity involves the number of types of 
evaluation forms used in the studies. Indeed, it was rare to fmd two studies 
that used the same form. Many of the researchers were so vague in de- 
scribing the instruments they used that it would be impossible Ut replicate 
their studies. Other researchers used forms that lacked rationale for de- 
vising items, lacked provisions for revising items, and that had nu relia- 
bility and validity information. Some researchers used a portion of items 
from other instruments but offered »no reliability or validity for those 
items. Mar§h and Overall (1980) suggest that even if different evaluation 
instruments that had similar factor labels were used, there is no guarantee 
that the factors are indeed the^same. Rating forms have been Jeveloped 
in several different ways. Benton (1979) reports there are two approaches 
to developing items to be included in the final form of instruments, a 
rational approach and an empirical approach. The review of the literature 
docs not indicate which of the two types of instruments would be the 
better predictor of criterion measures. In many cases one does not know, 
when reviewing the research, whether an instrument was developed by 
one of the two approaches or whether the items on the instrument simply 
have face validity. 

Another possiblity is that many of the studies did not distinguish be* 
tween who was being evaluated, TAs or full-time professors. Although 
apparently the majority of these studies used TAs, the findings h.^ve been 
ovcrgeneralized to represent college and university teachers in general. It 
is not easy to set up such studies involving full-time professors. One sus- 
pects it is exceedingly difficult Jo get a large number of professors to use 
a common examination, textbook, and syllabus. Unfortunately many full- 
time professors' salaries, tenure status, and promotions are being deter- 
mined, at least in part, by these instruments that have too little empirical 
research with full-time professors. 

There is some evidence that the evaluation of TAs and full-tim. pro- 
fessors is significantly different with such Instruments, and that there is 
greater criterion validity support for the use of student' ratings for full- 
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time, experienced professors than for iheir use with TAs. Until there is 
evidence to suppori the practice, it is recommended that full-time pro 
fessors not be evaluated using instruments on which the only reliability 
and validity data have to do with TAs and vice versa. 

There are other possibilities not reflected in the table that could have 
contributed to the variation in the reported findings. Some oi the studies 
present little evidence that student evaluation instruments were admin 
istered under standardized conditions. It is, perhaps, Lommon knowledge 
that a lack of anonymity affects ratings and that if an instructor remains 
in the classroom, the rdtings will be different than if the instructor le.aves. 
Only a few researchers reported whether the latter was a part of the 
procedures. An> number of other variations in the administration of the 
rating instruments could have contributed to differences in results. 

Another possible source of the diversity concerns the criterion measures 
used in the studies. In some studies psychometric properties of the in- 
struments are not known. In other instances no information was reported 
about the scoring of these criterion measures. • 

Another possible reason for the inconsistent findings is that many of 
the studies have not provided adequate control for Initial differences in 
the sections of the courses. Sonie researchers adjusted foi initial student 
ability but were not consistent in the measures used to adjust for ability. 
Other researchers used no control for initial ability. It has been pieviouslv 
mentioned that the sections could be different in othci areas; such as 
motivation, which couldaffect evaluation of in.Miuction. Some rcsearchcis 
used samples in which the students selected courses without know ing who 
the instructor was to be, a procedure that is lc*ss adequate than random 
i/ation. In only two studies were students randomlv assigned to sections. 

Another factor that could have caused results to differ was the time 
the evaluations wcie administered. There is no ideal time to do so. There 
is evidence that evaluations administered as early as mid teim will ha\e 
different> resUlts from those administered'at the end of the course. When 
administered before the final examination the student.s ha\e not e.xpcri 
e need an important part of the course that .should be apart of the ins true toi 
evaluation. Also, when the ratings are administered at the time of the final 
examination, test anxiet> might contaminate the instruction e\aluation. 

Frey (1973b) indicates that when student ratings are made after the 
grades are known the course evaluation might simpl) reflect the students' 
acceptance ol iheir instructors* evaluations of them, Frcy also mentions 
a "retaliation hypothesis/' i.e., the student^i may tend to mark lower an 
insti'uctor who has given them a low grade. Rodin, Frey. and Gessner 
(1975) mention the "reward hypothesis/* i.e.. the students may tend to 
man** higher an instructor who has given them a high grade. 

Al. hough it is rarely mentioned in any of the research and could not 
be accounted for in the summary table, one research design problem fui 
ther clouds the issue. Most research projects of this nature depend upon 
instructors who will volunteer to be evaluated. It may be that fewer of 
the pool instructors volunteer, thus, there is not as much \ariabilit> in 
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the teacher rating scores in the samples as might be representative of the 
total population of college instructors. In other words, there might not be 
a truly representative range ol teaching abilities in the various subjects 
of many of these studies. Perhaps if the range of instructors were increased, 
a greater relationship between student ratings of instruction aud'measures 
^of teacher effectiveness would exist. 

However, in defense of these studies, one should note that most of the 
actual use of student evaluation instruments depends on the fatultv mem- 
bers volunteering to use them. Most cpllegcs and universities oimply sav 
that some kind of defetisible evidence must be produced to support a^ 
teacher's candidacy fur promotion, tenure,. or jtierit.pa} . Many professors 
turn to student evaluations for such evidence. Obviously those* who know 
• they are going to get poor student e\aluations are not going to use them 
if they can possibly avoid it. Thus, j^en though the volunteer aspect of 
the student evaluation-of-instruction studies mav be a limitation because 
it does not accurately represent the total population of college tv?athers, 
it probabK is a strength because the volunteer aspect may accurately 
represent the actual present use of such instruments. 

As long as a great many unabsolved questions remain about academic 
freedom and evaluation of teaching, it is likclv that a certain amount of 
volunteerism will continue with the u.sr of student ratings of instruction. 

Student evaluations of instruction have long been used^bv individual 
instructors to help them improve their teaching. In rec^»nt vears colleges 
and universities hav*; had to become acutelv aware of the pussibilitv of 
litigation with personnel decisions. This concern with litigation has forced 
institutions of higher learning to look for evidence lO substantiate per 
sonnel decisions. Those seeking the improvement of teaching and those 
seeking a rv.^ti: objective daja base for decision making have turned more 
and more to student evaluations of instruction. 

As with all cases of evidence, concern must eventually turn to the 
quality of that evidence. Manv criticisms have been leveled at student 
evaluations of instruction. Some of this criticism has come from mea 
sure men t and evaluation experts. Much of it comes, one suspects, from 
professors and teaching assistants who do not get vcrv good student eval 
nations. In considering these criticisms of student ratings one must turn 
to fundamental questions of their legitimacy. No iatmg,j?rocedure should 
be used to modify teaching methods or in university governance unlc^ss 
that procedure has establishcjd validity. ^ 

Of first consideration in tnatters of student ratings is the question of 
criterion i jlidity,*i.e.,how well do student ratings hold upkv^en compared 
to accepted indicators of good and poor teaching. j 

It seems quite clear that student lutings of instruction provide good 
evidence of the quality of teaching. However, they provide evidence only, 
they should not be considered to be more than evidence. They should 
never be considered alone as positive proof. It is quite clear that there is 
something more to teaching than can ever be totally accounted for by 
those who arc taught. 
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